A data science and machine learning project for analyzing financial behavior and predicting credit default risk and credit scores using statistical inference, dimensionality reduction, clustering, and supervised learning models.
This project presents a full end-to-end credit risk analysis pipeline, combining classical statistics and modern machine learning techniques to model customer financial behavior and predict credit outcomes.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import prince
from xgboost import XGBClassifier, XGBRegressor
from sklearn.preprocessing import PowerTransformer, StandardScaler, RobustScaler
from sklearn.cross_decomposition import CCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import roc_auc_score, mean_squared_error, r2_score, roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix, mean_absolute_error
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import Ridge
from imblearn.over_sampling import SMOTE
from scipy.stats import(
norm
, t
, chi2
, chi2_contingency
, kstest
, skew
)
#File path image
from pathlib import Path
figure_dir = Path("figure")
figure_dir.mkdir(exist_ok=True)
The dataset used in this study is publicly available on Kaggle and is called “Credit Score”. It contains with 1000 rows and 84 features comprehensive financial and behavioral information on a large number of customers. This dataset includes a wide range of variables, providing a multidimensional view of an individual’s financial situation, spending behavior, and potential credit risk. By combining raw financial measures, behavioral ratios, and categorical attributes, this dataset enables the development of robust and interpretable predictive models for analyzing credit risk and default.
Key Target Variables:
Description of Features:
Transaction groups (GROCERIES, CLOTHING, HOUSING, EDUCATION, HEALTH, TRAVEL, ENTERTAINMENT,GAMBLING, UTILITIES, TAX, FINES) are categorized.
T_{GROUP}_6: Total expenditure in that group in the last 6 months
R_[GROUP]DEBT: Ratio of T[GROUP]_12 to DEBT
Categorical Features:
CAT_GAMBLING: Gambling category (none, low, high)
Summary :
df = pd.read_csv(r"./datasets/credit_score.csv")
df
| CUST_ID | INCOME | SAVINGS | DEBT | R_SAVINGS_INCOME | R_DEBT_INCOME | R_DEBT_SAVINGS | T_CLOTHING_12 | T_CLOTHING_6 | R_CLOTHING | ... | R_EXPENDITURE_SAVINGS | R_EXPENDITURE_DEBT | CAT_GAMBLING | CAT_DEBT | CAT_CREDIT_CARD | CAT_MORTGAGE | CAT_SAVINGS_ACCOUNT | CAT_DEPENDENTS | CREDIT_SCORE | DEFAULT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C02COQEVYU | 33269 | 0 | 532304 | 0.0000 | 16.0000 | 1.2000 | 1889 | 945 | 0.5003 | ... | 0.0000 | 0.0625 | High | 1 | 0 | 0 | 0 | 0 | 444 | 1 |
| 1 | C02OZKC0ZF | 77158 | 91187 | 315648 | 1.1818 | 4.0909 | 3.4615 | 5818 | 111 | 0.0191 | ... | 0.7692 | 0.2222 | No | 1 | 0 | 0 | 1 | 0 | 625 | 0 |
| 2 | C03FHP2D0A | 30917 | 21642 | 534864 | 0.7000 | 17.3000 | 24.7142 | 1157 | 860 | 0.7433 | ... | 1.4286 | 0.0578 | High | 1 | 0 | 0 | 1 | 0 | 469 | 1 |
| 3 | C03PVPPHOY | 80657 | 64526 | 629125 | 0.8000 | 7.8000 | 9.7499 | 6857 | 3686 | 0.5376 | ... | 1.2500 | 0.1282 | High | 1 | 0 | 0 | 1 | 0 | 559 | 0 |
| 4 | C04J69MUX0 | 149971 | 1172498 | 2399531 | 7.8182 | 16.0000 | 2.0465 | 1978 | 322 | 0.1628 | ... | 0.1163 | 0.0568 | High | 1 | 1 | 1 | 1 | 1 | 473 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | CZQHJC9HDH | 328892 | 1465066 | 5501471 | 4.4546 | 16.7273 | 3.7551 | 16701 | 10132 | 0.6067 | ... | 0.2041 | 0.0543 | High | 1 | 1 | 1 | 1 | 1 | 418 | 0 |
| 996 | CZRA4MLB0P | 81404 | 88805 | 680837 | 1.0909 | 8.3637 | 7.6667 | 5400 | 1936 | 0.3585 | ... | 0.8333 | 0.1087 | No | 1 | 0 | 0 | 1 | 0 | 589 | 1 |
| 997 | CZSOD1KVFX | 0 | 42428 | 30760 | 3.2379 | 8.1889 | 0.7250 | 0 | 0 | 0.8779 | ... | 0.2500 | 0.3448 | No | 1 | 0 | 0 | 1 | 0 | 499 | 0 |
| 998 | CZWC76UAUT | 36011 | 8002 | 604181 | 0.2222 | 16.7777 | 75.5037 | 1993 | 1271 | 0.6377 | ... | 5.0002 | 0.0662 | No | 1 | 1 | 0 | 1 | 0 | 507 | 0 |
| 999 | CZZV5B3SAL | 44266 | 309859 | 44266 | 6.9999 | 1.0000 | 0.1429 | 1574 | 1264 | 0.8030 | ... | 0.1587 | 1.1111 | No | 1 | 0 | 0 | 1 | 0 | 657 | 0 |
1000 rows × 87 columns
df.columns[1:5]
Index(['INCOME', 'SAVINGS', 'DEBT', 'R_SAVINGS_INCOME'], dtype='object')
print("Target columns in the dataset: ")
df[['DEFAULT', 'CREDIT_SCORE']]
Target columns in the dataset:
| DEFAULT | CREDIT_SCORE | |
|---|---|---|
| 0 | 1 | 444 |
| 1 | 0 | 625 |
| 2 | 1 | 469 |
| 3 | 0 | 559 |
| 4 | 0 | 473 |
| ... | ... | ... |
| 995 | 0 | 418 |
| 996 | 1 | 589 |
| 997 | 0 | 499 |
| 998 | 0 | 507 |
| 999 | 0 | 657 |
1000 rows × 2 columns
print("Dataset infomation : ")
df.info()
Dataset infomation : <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 87 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUST_ID 1000 non-null object 1 INCOME 1000 non-null int64 2 SAVINGS 1000 non-null int64 3 DEBT 1000 non-null int64 4 R_SAVINGS_INCOME 1000 non-null float64 5 R_DEBT_INCOME 1000 non-null float64 6 R_DEBT_SAVINGS 1000 non-null float64 7 T_CLOTHING_12 1000 non-null int64 8 T_CLOTHING_6 1000 non-null int64 9 R_CLOTHING 1000 non-null float64 10 R_CLOTHING_INCOME 1000 non-null float64 11 R_CLOTHING_SAVINGS 1000 non-null float64 12 R_CLOTHING_DEBT 1000 non-null float64 13 T_EDUCATION_12 1000 non-null int64 14 T_EDUCATION_6 1000 non-null int64 15 R_EDUCATION 1000 non-null float64 16 R_EDUCATION_INCOME 1000 non-null float64 17 R_EDUCATION_SAVINGS 1000 non-null float64 18 R_EDUCATION_DEBT 1000 non-null float64 19 T_ENTERTAINMENT_12 1000 non-null int64 20 T_ENTERTAINMENT_6 1000 non-null int64 21 R_ENTERTAINMENT 1000 non-null float64 22 R_ENTERTAINMENT_INCOME 1000 non-null float64 23 R_ENTERTAINMENT_SAVINGS 1000 non-null float64 24 R_ENTERTAINMENT_DEBT 1000 non-null float64 25 T_FINES_12 1000 non-null int64 26 T_FINES_6 1000 non-null int64 27 R_FINES 1000 non-null float64 28 R_FINES_INCOME 1000 non-null float64 29 R_FINES_SAVINGS 1000 non-null float64 30 R_FINES_DEBT 1000 non-null float64 31 T_GAMBLING_12 1000 non-null int64 32 T_GAMBLING_6 1000 non-null int64 33 R_GAMBLING 1000 non-null float64 34 R_GAMBLING_INCOME 1000 non-null float64 35 R_GAMBLING_SAVINGS 1000 non-null float64 36 R_GAMBLING_DEBT 1000 non-null float64 37 T_GROCERIES_12 1000 non-null int64 38 T_GROCERIES_6 1000 non-null int64 39 R_GROCERIES 1000 non-null float64 40 R_GROCERIES_INCOME 1000 non-null float64 41 R_GROCERIES_SAVINGS 1000 non-null float64 42 R_GROCERIES_DEBT 1000 non-null float64 43 T_HEALTH_12 1000 non-null int64 44 T_HEALTH_6 1000 non-null int64 45 R_HEALTH 1000 non-null float64 46 R_HEALTH_INCOME 1000 non-null float64 47 R_HEALTH_SAVINGS 1000 non-null float64 48 R_HEALTH_DEBT 1000 non-null float64 49 T_HOUSING_12 1000 non-null int64 50 T_HOUSING_6 1000 non-null int64 51 R_HOUSING 1000 non-null float64 52 R_HOUSING_INCOME 1000 non-null float64 53 R_HOUSING_SAVINGS 1000 non-null float64 54 R_HOUSING_DEBT 1000 non-null float64 55 T_TAX_12 1000 non-null int64 56 T_TAX_6 1000 non-null int64 57 R_TAX 1000 non-null float64 58 R_TAX_INCOME 1000 non-null float64 59 R_TAX_SAVINGS 1000 non-null float64 60 R_TAX_DEBT 1000 non-null float64 61 T_TRAVEL_12 1000 non-null int64 62 T_TRAVEL_6 1000 non-null int64 63 R_TRAVEL 1000 non-null float64 64 R_TRAVEL_INCOME 1000 non-null float64 65 R_TRAVEL_SAVINGS 1000 non-null float64 66 R_TRAVEL_DEBT 1000 non-null float64 67 T_UTILITIES_12 1000 non-null int64 68 T_UTILITIES_6 1000 non-null int64 69 R_UTILITIES 1000 non-null float64 70 R_UTILITIES_INCOME 1000 non-null float64 71 R_UTILITIES_SAVINGS 1000 non-null float64 72 R_UTILITIES_DEBT 1000 non-null float64 73 T_EXPENDITURE_12 1000 non-null int64 74 T_EXPENDITURE_6 1000 non-null int64 75 R_EXPENDITURE 1000 non-null float64 76 R_EXPENDITURE_INCOME 1000 non-null float64 77 R_EXPENDITURE_SAVINGS 1000 non-null float64 78 R_EXPENDITURE_DEBT 1000 non-null float64 79 CAT_GAMBLING 1000 non-null object 80 CAT_DEBT 1000 non-null int64 81 CAT_CREDIT_CARD 1000 non-null int64 82 CAT_MORTGAGE 1000 non-null int64 83 CAT_SAVINGS_ACCOUNT 1000 non-null int64 84 CAT_DEPENDENTS 1000 non-null int64 85 CREDIT_SCORE 1000 non-null int64 86 DEFAULT 1000 non-null int64 dtypes: float64(51), int64(34), object(2) memory usage: 679.8+ KB
# Checking data types of each columns
print(df.dtypes)
CUST_ID object
INCOME int64
SAVINGS int64
DEBT int64
R_SAVINGS_INCOME float64
...
CAT_MORTGAGE int64
CAT_SAVINGS_ACCOUNT int64
CAT_DEPENDENTS int64
CREDIT_SCORE int64
DEFAULT int64
Length: 87, dtype: object
# Count total number of each data types
print(df.dtypes.value_counts())
float64 51 int64 34 object 2 Name: count, dtype: int64
# Summary statistics
print("Summary Statistics:")
df.describe()
Summary Statistics:
| INCOME | SAVINGS | DEBT | R_SAVINGS_INCOME | R_DEBT_INCOME | R_DEBT_SAVINGS | T_CLOTHING_12 | T_CLOTHING_6 | R_CLOTHING | R_CLOTHING_INCOME | ... | R_EXPENDITURE_INCOME | R_EXPENDITURE_SAVINGS | R_EXPENDITURE_DEBT | CAT_DEBT | CAT_CREDIT_CARD | CAT_MORTGAGE | CAT_SAVINGS_ACCOUNT | CAT_DEPENDENTS | CREDIT_SCORE | DEFAULT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1000.000000 | 1.000000e+03 | 1.000000e+03 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | ... | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.00000 | 1000.000000 | 1000.000000 |
| mean | 121610.019000 | 4.131896e+05 | 7.907180e+05 | 4.063477 | 6.068449 | 5.867252 | 6822.401000 | 3466.320000 | 0.454848 | 0.055557 | ... | 0.943607 | 0.913340 | 0.605276 | 0.944000 | 0.236000 | 0.173000 | 0.993000 | 0.15000 | 586.712000 | 0.284000 |
| std | 113716.699591 | 4.429160e+05 | 9.817904e+05 | 3.968097 | 5.847878 | 16.788356 | 7486.225932 | 5118.942977 | 0.236036 | 0.037568 | ... | 0.168989 | 1.625278 | 1.299382 | 0.230037 | 0.424835 | 0.378437 | 0.083414 | 0.35725 | 63.413882 | 0.451162 |
| min | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.003400 | ... | 0.666700 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 300.000000 | 0.000000 |
| 25% | 30450.250000 | 5.971975e+04 | 5.396675e+04 | 1.000000 | 1.454500 | 0.206200 | 1084.500000 | 319.500000 | 0.263950 | 0.029700 | ... | 0.833300 | 0.158700 | 0.100000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.00000 | 554.750000 | 0.000000 |
| 50% | 85090.000000 | 2.738505e+05 | 3.950955e+05 | 2.545450 | 4.911550 | 2.000000 | 4494.000000 | 1304.000000 | 0.468850 | 0.046800 | ... | 0.909100 | 0.327950 | 0.178600 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.00000 | 596.000000 | 0.000000 |
| 75% | 181217.500000 | 6.222600e+05 | 1.193230e+06 | 6.307100 | 8.587475 | 4.509600 | 10148.500000 | 4555.500000 | 0.626300 | 0.069400 | ... | 1.000000 | 0.833300 | 0.588200 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.00000 | 630.000000 | 1.000000 |
| max | 662094.000000 | 2.911863e+06 | 5.968620e+06 | 16.111200 | 37.000600 | 292.842100 | 43255.000000 | 39918.000000 | 1.058300 | 0.251700 | ... | 2.000200 | 10.009900 | 10.005300 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 800.000000 | 1.000000 |
8 rows × 85 columns
X = df["CREDIT_SCORE"].dropna().astype(float).values
n = len(X)
## MLE Estimators
mu_mle = X.mean()
sigma2_mle = ((X-mu_mle)**2).mean()
sigma2_unbiased = np.var(X, ddof=1)
print("=== MLE Estimation for CREDIT_SCORE ===")
print(f"n = {n}")
print(f"MLE mean (μ̂_MLE): {mu_mle:.4f}")
print(f"MLE variance (σ̂²_MLE, biased): {sigma2_mle:.4f}")
print(f"Unbiased variance (s²): {sigma2_unbiased:.4f}")
=== MLE Estimation for CREDIT_SCORE === n = 1000 MLE mean (μ̂_MLE): 586.7120 MLE variance (σ̂²_MLE, biased): 4017.2991 Unbiased variance (s²): 4021.3204
# signficant level
alpha = 0.05
s = np.sqrt(sigma2_unbiased)
# CI for mu knowing sigma
z = norm.ppf(1 - alpha/2)
ci_mu_known = (
mu_mle - z * s / np.sqrt(n)
, mu_mle + z * s / np.sqrt(n)
)
# CI for mu knowing sigma
t_val = t.ppf(1 - alpha/2, df=n-1)
ci_mu_unknown = (
mu_mle - t_val * s / np.sqrt(n)
, mu_mle + t_val * s / np.sqrt(n)
)
print("\n=== Confidence Intervals for μ ===")
print(f"95% CI (σ known): {ci_mu_known}")
print(f"95% CI (σ unknown): {ci_mu_unknown}")
=== Confidence Intervals for μ === 95% CI (σ known): (np.float64(582.7816391220821), np.float64(590.6423608779179)) 95% CI (σ unknown): (np.float64(582.7768715135624), np.float64(590.6471284864376))
chi2_low = chi2.ppf(alpha/2, df=n-1)
chi2_high = chi2.ppf(1 - alpha/2, df=n-1)
ci_sigma2 = (
(n-1) * sigma2_unbiased / chi2_high
, (n-1) * sigma2_unbiased / chi2_low
)
print("\n=== Confidence Interval for σ² (Chi-square) ===")
print(f"95% CI for variance σ²: {ci_sigma2}")
print(f"Approx CI for std deviation σ: ({np.sqrt(ci_sigma2[0]):.3f}, {np.sqrt(ci_sigma2[1]):.3f})")
=== Confidence Interval for σ² (Chi-square) === 95% CI for variance σ²: (np.float64(3690.7182226729915), np.float64(4398.658343162247)) Approx CI for std deviation σ: (60.751, 66.322)
# Test H0: mu=600 for credit score
mu0 = 600
T_stat = (mu_mle - mu0) / (s / np.sqrt(n))
p_val_mean = 2 * (1 - t.cdf(abs(T_stat), df=n-1))
print("\n=== Hypothesis Test: Mean of CREDIT_SCORE ===")
print(f"H0: μ = {mu0}")
print(f"T statistic = {T_stat:.4f}")
print(f"p-value = {p_val_mean:.4f}")
=== Hypothesis Test: Mean of CREDIT_SCORE === H0: μ = 600 T statistic = -6.6264 p-value = 0.0000
This result shows that a one-sample t-test was performed to assess whether the mean CREDIT_SCORE equals 600. The result (T = −6.626, p < 0.001) indicates that the null hypothesis can be rejected. The population mean is significantly lower than 600.
# Variance test
sigma0 = 60**2
V_stat = (n-1) * sigma2_unbiased / sigma0
p_val_var = 2 * min(chi2.cdf(V_stat, df=n-1),
1 - chi2.cdf(V_stat, df=n-1))
print("\n=== Hypothesis Test: Variance of CREDIT_SCORE ===")
print(f"H0: σ² = {sigma0}")
print(f"Chi-square statistic = {V_stat:.4f}")
print(f"p-value = {p_val_var:.4f}")
=== Hypothesis Test: Variance of CREDIT_SCORE === H0: σ² = 3600 Chi-square statistic = 1115.9164 p-value = 0.0113
## CAT_GAMBLING VS DEFAULT
contingency = pd.crosstab(df["CAT_GAMBLING"], df["DEFAULT"])
chi2_stat, p_ind, dof, exp = chi2_contingency(contingency)
print("\n=== Independence Test: CAT_GAMBLING vs DEFAULT ===")
print(f"Chi2 statistic = {chi2_stat:.4f}")
print(f"p-value = {p_ind:.4f}")
print(f"Degrees of freedom = {dof}")
=== Independence Test: CAT_GAMBLING vs DEFAULT === Chi2 statistic = 2.6778 p-value = 0.2621 Degrees of freedom = 2
Gambling behavior does not appear to influence whether a customer defaults or not, based on this dataset.
X_std = (X - mu_mle) / s
ks_stat, ks_p = kstest(X_std, 'norm')
print("\n=== Kolmogorov–Smirnov Normality Test (CREDIT_SCORE) ===")
print(f"KS statistic = {ks_stat:.4f}")
print(f"p-value = {ks_p:.4f}")
# QQ-plot
plt.figure(figsize=(6,6))
stats.probplot(X, dist="norm", plot=plt)
plt.title("QQ Plot for CREDIT_SCORE")
plt.show()
=== Kolmogorov–Smirnov Normality Test (CREDIT_SCORE) === KS statistic = 0.0659 p-value = 0.0003
The Kolmogorov–Smirnov test (p = 0.0003) rejects the null hypothesis of normality for CREDIT_SCORE. The QQ-plot also shows systematic deviations from the theoretical Gaussian line, with heavy tails and curvature. Therefore, CREDIT_SCORE cannot be considered normally distributed. This is where we can transform the distribution using such as (log, Box-Cox, Yeo-Johnson).
Data preprocessing is a crucial step to ensure the quality of the data and to produce the best possible results. We cleaned and prepared the dataset so that all features could be properly used for different data analysis techniques and machine learning models.
## fill up the nan values
df = df.replace([np.inf, -np.inf], np.nan)
df = df.fillna(0)
df = df.drop(columns=["CUST_ID"])
The dataset also contained two categorical variables: CUST_ID, which was removed because it serves only as an identifier, and CAT_GAMBLING, which was encoded into numerical values by converting the categories "no", "low", and "high" into 0, 1, and 2 respectively.
We do not remove outlier here, because credit score datasets naturally contain extreme behaviors (high debt, zero savings). These are meaningful and should remain in the model. Also drop ID out since it is just an identifier.
N_A = df.isna().sum().sort_values(ascending=False)
N_A, N_A.sum()
(INCOME 0
SAVINGS 0
DEBT 0
R_SAVINGS_INCOME 0
R_DEBT_INCOME 0
..
CAT_MORTGAGE 0
CAT_SAVINGS_ACCOUNT 0
CAT_DEPENDENTS 0
CREDIT_SCORE 0
DEFAULT 0
Length: 86, dtype: int64,
np.int64(0))
# duplicate
duplicate_rows = df[df.duplicated(keep=False)]
duplicate_row = df.duplicated().sum()
print("Sum total duplicates = ", duplicate_row)
Sum total duplicates = 0
After checking we can observe that the dataset doesn't contain any missing values and duplicate values.
CAT_GAMBLING columns was encoded into numerical values by converting the categories "no", "low", and "high" into 0, 1, and 2 respectively.
cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print("Categorical columns:", cat_cols)
Categorical columns: ['CAT_GAMBLING']
df["CAT_GAMBLING"].value_counts()
CAT_GAMBLING No 620 High 264 Low 116 Name: count, dtype: int64
gambling = {'High':2,'Low':1,'No':0}
df["CAT_GAMBLING"] = df["CAT_GAMBLING"].map(gambling)
df.head()
| INCOME | SAVINGS | DEBT | R_SAVINGS_INCOME | R_DEBT_INCOME | R_DEBT_SAVINGS | T_CLOTHING_12 | T_CLOTHING_6 | R_CLOTHING | R_CLOTHING_INCOME | ... | R_EXPENDITURE_SAVINGS | R_EXPENDITURE_DEBT | CAT_GAMBLING | CAT_DEBT | CAT_CREDIT_CARD | CAT_MORTGAGE | CAT_SAVINGS_ACCOUNT | CAT_DEPENDENTS | CREDIT_SCORE | DEFAULT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 33269 | 0 | 532304 | 0.0000 | 16.0000 | 1.2000 | 1889 | 945 | 0.5003 | 0.0568 | ... | 0.0000 | 0.0625 | 2 | 1 | 0 | 0 | 0 | 0 | 444 | 1 |
| 1 | 77158 | 91187 | 315648 | 1.1818 | 4.0909 | 3.4615 | 5818 | 111 | 0.0191 | 0.0754 | ... | 0.7692 | 0.2222 | 0 | 1 | 0 | 0 | 1 | 0 | 625 | 0 |
| 2 | 30917 | 21642 | 534864 | 0.7000 | 17.3000 | 24.7142 | 1157 | 860 | 0.7433 | 0.0374 | ... | 1.4286 | 0.0578 | 2 | 1 | 0 | 0 | 1 | 0 | 469 | 1 |
| 3 | 80657 | 64526 | 629125 | 0.8000 | 7.8000 | 9.7499 | 6857 | 3686 | 0.5376 | 0.0850 | ... | 1.2500 | 0.1282 | 2 | 1 | 0 | 0 | 1 | 0 | 559 | 0 |
| 4 | 149971 | 1172498 | 2399531 | 7.8182 | 16.0000 | 2.0465 | 1978 | 322 | 0.1628 | 0.0132 | ... | 0.1163 | 0.0568 | 2 | 1 | 1 | 1 | 1 | 1 | 473 | 0 |
5 rows × 86 columns
variances = df.var().sort_values()
print(f"Variances = {variances.values}")
plt.figure(figsize=(10,4))
plt.plot(variances.values)
plt.title("Variance of Features")
plt.ylabel("Variance")
plt.show()
Variances = [1.33032933e-08 6.13660050e-07 1.93307207e-06 3.45459504e-06 3.64886558e-06 1.12808807e-05 2.97442591e-04 3.64860707e-04 4.53116219e-04 4.93765029e-04 6.66274890e-04 1.07865003e-03 1.36812647e-03 1.41138794e-03 1.96084766e-03 2.04296282e-03 2.73081684e-03 2.78940389e-03 3.27219146e-03 3.55618131e-03 3.91643733e-03 4.02426458e-03 5.48148731e-03 6.35846090e-03 6.46103765e-03 6.95795796e-03 7.13498630e-03 7.90829318e-03 9.54764292e-03 1.22973155e-02 1.87081946e-02 1.98485994e-02 2.00215421e-02 2.85571529e-02 3.21955423e-02 3.87907156e-02 4.51129519e-02 4.93700844e-02 5.15900335e-02 5.29169169e-02 5.57131097e-02 5.58459958e-02 5.62393284e-02 5.87347245e-02 8.18204286e-02 1.27627628e-01 1.43214214e-01 1.80484484e-01 2.03547548e-01 2.44957755e-01 3.33654442e-01 6.20261792e-01 7.58022022e-01 1.68839248e+00 2.64152938e+00 1.57457917e+01 3.41976728e+01 2.81848902e+02 4.02132038e+03 7.83387427e+03 1.85427467e+04 5.54892274e+06 6.60735188e+06 1.00664006e+07 1.14652552e+07 1.26127311e+07 2.15555279e+07 2.50716269e+07 2.62035772e+07 2.82683687e+07 3.98629189e+07 4.99241214e+07 5.43827157e+07 5.60435787e+07 8.19987580e+07 1.06364134e+08 1.53467194e+08 3.20138940e+08 3.68920740e+08 4.97543175e+08 1.28321569e+09 2.48541526e+09 7.96559696e+09 1.29314878e+10 1.96174616e+11 9.63912373e+11]
To better check the anomaly, it is great to see how the data distributed within the boxplot visualization.
num_cols = df.select_dtypes(include=['int64', 'float64']).columns
cols = num_cols[:8]
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
for idx, col in enumerate(cols):
row = idx // 4
col_pos = idx % 4
axes[row, col_pos].boxplot(df[col], vert=True, patch_artist=True,
boxprops=dict(facecolor='orange'))
axes[row, col_pos].set_title(f"Box Plot for {col}")
axes[row, col_pos].set_ylabel(col)
plt.tight_layout()
plt.show()
In this situation, We decide not to remove the outliers because they are the customers that we want to detect.
For example, people with extremely high debt, people with zero saving. These outliers are high-risk cases.
We use common detect method which is Interquatile Range (IQR) rule.
For each numerical variable $X$
$$ Q_{1} = \text{25-th percentile} \qquad Q_{3}=\text{75-th percentile} $$And obersavtion $x_{i}$ is classified as an outlier if $$ x_{i} < Q_{1} - 1.25IQR \qquad \text{or} \qquad x_{i} > Q_{3} + 1.5IQR$$
iqr_columns = [
"INCOME", "SAVINGS", "DEBT",
"R_SAVINGS_INCOME", "R_DEBT_INCOME",
"CREDIT_SCORE"
]
for col in iqr_columns:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df[col] = df[col].clip(lower, upper)
# Reset index after dropping outliers to avoid indexing issues
df = df.reset_index(drop=True)
print(f"Index reset. DataFrame shape: {df.shape}")
Index reset. DataFrame shape: (1000, 86)
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
for idx, col in enumerate(iqr_columns):
r = idx // 4
c = idx % 4
axes[r, c].boxplot(df[col], patch_artist=True,
boxprops=dict(facecolor='orange'))
axes[r, c].set_title(f"{col} (After IQR Winsorization)")
plt.tight_layout()
plt.show()
We applied IQR-based outlier handling only to selected features (INCOME, SAVINGS, DEBT, CREDIT_SCORE, and moderate ratios). These variables have distributions where IQR is meaningful. We did not apply IQR to long-tailed or categorical variables to avoid removing informative extreme financial behaviors.
plots_per_page = 12
cols_per_row = 4
rows_per_page = 3
for i in range(0, len(num_cols), plots_per_page):
subset = num_cols[i:i+plots_per_page]
fig, axes = plt.subplots(rows_per_page, cols_per_row, figsize=(20, 12))
axes = axes.flatten()
for j, col in enumerate(subset):
axes[j].hist(df[col], bins=30, color='skyblue', edgecolor='black')
axes[j].set_title(col)
# Hide unused axes
for k in range(len(subset), plots_per_page):
axes[k].set_visible(False)
plt.tight_layout()
plt.show()
Group columns to do log-transformation
# Group 1 — Heavy-tailed monetary variables
money_cols = ["INCOME", "SAVINGS", "DEBT"]
# Group 2 — All T_* expenditure features
t_cols = [col for col in df.columns if col.startswith("T_")]
# Group 3 — Extreme ratio variables (_SAVINGS, _DEBT)
extreme_ratio_cols = [
col for col in df.columns
if col.endswith("_SAVINGS") or col.endswith("_DEBT")
]
# Group 4 — Moderately skewed _INCOME ratios
income_ratio_cols = [
col for col in df.columns
if col.endswith("_INCOME")
]
# Group 5 — Balanced ratio variables (no transform expected)
balanced_ratio_cols = [
col for col in df.columns
if col.startswith("R_")
and col not in extreme_ratio_cols
and col not in income_ratio_cols
]
# Group 6 — Binary categorical variables
binary_cols = [col for col in df.columns if col.startswith("CAT_")]
# Group 7 — Target variables
target_cols = ["CREDIT_SCORE", "DEFAULT"]
def plot_group(columns, title, df, cols_per_row=4):
if len(columns) == 0:
return
rows = (len(columns) + cols_per_row - 1) // cols_per_row
fig, axes = plt.subplots(rows, cols_per_row, figsize=(20, rows * 4))
axes = axes.flatten()
for i, col in enumerate(columns):
axes[i].hist(df[col], bins=30, color='skyblue', edgecolor='black')
axes[i].set_title(col)
# Hide empty axes
for j in range(len(columns), len(axes)):
axes[j].set_visible(False)
fig.suptitle(title, fontsize=20)
plt.tight_layout()
plt.show()
plot_group(money_cols, "Group 1 — Heavy-Tailed Monetary Variables\n", df)
plot_group(t_cols, "Group 2 — T_* Expenditure Variables\n", df)
plot_group(extreme_ratio_cols, "Group 3 — Extreme Ratio Variables (_SAVINGS & _DEBT)\n", df)
plot_group(income_ratio_cols, "Group 4 — Moderately Skewed Income Ratios\n", df)
plot_group(balanced_ratio_cols, "Group 5 — Balanced Ratio Variables\n", df)
plot_group(binary_cols, "Group 6 — Binary Variables\n", df)
plot_group(target_cols, "Group 7 — Target Variables\n", df)
# 1. Heavy monetary variables
must_transform = set(money_cols)
# 2. All T_* expenditure variables
must_transform.update(t_cols)
# 3. Extreme *_SAVINGS and *_DEBT ratios (exclude binary CAT_DEBT)
must_transform.update([c for c in extreme_ratio_cols if c != "CAT_DEBT"])
# 4. Strongly skewed income ratios (subset of income_ratio_cols)
strong_income_transform = [
"R_SAVINGS_INCOME"
, "R_DEBT_INCOME"
, "R_EDUCATION_INCOME"
, "R_ENTERTAINMENT_INCOME"
, "R_GAMBLING_INCOME"
, "R_HEALTH_INCOME"
, "R_TAX_INCOME"
, "R_TRAVEL_INCOME"
]
must_transform.update(strong_income_transform)
Does transform really help in PCA? => Yes, because in PCA, it already assumes that the features are roughly normal.
To reduce skewness, we apply transformations that make the distribution more symmetric.
Log1p Transform $$ x' = log(1+x)$$
Yeo-Jonhson Transform Generalizes form of Box-Cox and workks for zero and negative values. It automaticlLY select a power parameter $\lambda$ to minimize skewness.
For $x \ge 0$: $$ T(x;\lambda) = \begin{cases} \dfrac{(x+1)^{\lambda} - 1}{\lambda}, & \lambda \ne 0, \\[8pt] \ln(x+1), & \lambda = 0. \end{cases} $$
For $x < 0$: $$ T(x;\lambda) = \begin{cases} -\dfrac{(-x+1)^{\, 2-\lambda} - 1}{\, 2-\lambda}, & \lambda \ne 2, \\[8pt] -\ln(-x+1), & \lambda = 2. \end{cases} $$
Both transformations are applied only to variables where:
$$ | \text(skew)(X) | > 0.75 $$df_original = df.copy()
num_cols = df.select_dtypes(include=['float64', 'int64']).columns
skip_cols = ["CREDIT_SCORE", "DEFAULT"] + [col for col in df.columns if col.startswith("CAT_")]
skew_threshold = 0.75
transform_cols = [
col for col in num_cols
if col not in skip_cols and abs(skew(df[col])) > skew_threshold
]
df_log = df_original.copy()
df_log[transform_cols] = np.log1p(df_original[transform_cols])
def compare_hist_log(col):
fig, axes = plt.subplots(1, 2, figsize=(12,4))
axes[0].hist(df_original[col], bins=40, color='skyblue', edgecolor='black')
axes[0].set_title(f"{col} — BEFORE (Original)")
axes[1].hist(df_log[col], bins=40, color='salmon', edgecolor='black')
axes[1].set_title(f"{col} — AFTER log1p")
plt.tight_layout()
plt.show()
pt = PowerTransformer(method="yeo-johnson")
df_yeo = df_original.copy()
df_yeo[transform_cols] = pt.fit_transform(df_original[transform_cols])
def compare_hist_yeojohnson(col):
fig, axes = plt.subplots(1, 2, figsize=(12,4))
axes[0].hist(df_original[col], bins=40, color='skyblue', edgecolor='black')
axes[0].set_title(f"{col} — BEFORE (Original)")
axes[1].hist(df_yeo[col], bins=40, color='salmon', edgecolor='black')
axes[1].set_title(f"{col} — AFTER (Yeo-Johnson)")
plt.tight_layout()
plt.show()
for col in transform_cols[:6]:
compare_hist_log(col)
compare_hist_yeojohnson(col)
df = df_yeo.copy()
They all now has a better balancing distribution. We can come back again and compare the result between the original and transformed one.
Do correlation analysis before scaling.
pearson_corr = df.corr(method='pearson')
plt.figure(figsize=(16, 12))
sns.heatmap(pearson_corr, cmap='coolwarm', center=0)
plt.title("Pearson Correlation Heatmap", fontsize=20)
plt.show()
# Check in 5 columns
sns.heatmap(pearson_corr.iloc[: 5, : 5], annot=True)
plt.figure(figsize =(3,2))
plt.show()
<Figure size 300x200 with 0 Axes>
Check high-correlated pairs
# We take 85% threshold
threshold = 0.85
high_corr_pairs = (
pearson_corr
.where(np.triu(np.ones(pearson_corr.shape), k=1).astype(bool))
.stack()
.reset_index()
)
high_corr_pairs.columns = ['Feature_1', 'Feature_2', 'Correlation']
# Filter by threshold and remove old index numbers
high_corr_pairs = (
high_corr_pairs[high_corr_pairs['Correlation'].abs() >= threshold]
.sort_values(by="Correlation", ascending=False)
.reset_index(drop=True)
)
print(high_corr_pairs[1:5])
Feature_1 Feature_2 Correlation 1 T_EDUCATION_12 T_EDUCATION_6 0.999965 2 T_HOUSING_12 T_HOUSING_6 0.999961 3 T_GAMBLING_12 T_GAMBLING_6 0.999922 4 T_TAX_12 T_TAX_6 0.999163
The Exploratory Data Analysis (EDA) in this project was conducted on the credit score dataset. The primary goal was to understand the distribution of the target variables CREDIT_SCORE and DEFAULT and identify key features that correlate most strongly with customer default risk.
object_cols = df.select_dtypes(include=['object']).columns.tolist()
number_object = len(object_cols)
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
number_numerical = len(numeric_cols)
print(f"\nThe total objects columns in dataset is: {object_cols} \n==> equal {number_object}")
print(f"The total numerical columns in dataset is: {numeric_cols} \n==> equal {number_numerical}")
The total objects columns in dataset is: [] ==> equal 0 The total numerical columns in dataset is: ['INCOME', 'SAVINGS', 'DEBT', 'R_SAVINGS_INCOME', 'R_DEBT_INCOME', 'R_DEBT_SAVINGS', 'T_CLOTHING_12', 'T_CLOTHING_6', 'R_CLOTHING', 'R_CLOTHING_INCOME', 'R_CLOTHING_SAVINGS', 'R_CLOTHING_DEBT', 'T_EDUCATION_12', 'T_EDUCATION_6', 'R_EDUCATION', 'R_EDUCATION_INCOME', 'R_EDUCATION_SAVINGS', 'R_EDUCATION_DEBT', 'T_ENTERTAINMENT_12', 'T_ENTERTAINMENT_6', 'R_ENTERTAINMENT', 'R_ENTERTAINMENT_INCOME', 'R_ENTERTAINMENT_SAVINGS', 'R_ENTERTAINMENT_DEBT', 'T_FINES_12', 'T_FINES_6', 'R_FINES', 'R_FINES_INCOME', 'R_FINES_SAVINGS', 'R_FINES_DEBT', 'T_GAMBLING_12', 'T_GAMBLING_6', 'R_GAMBLING', 'R_GAMBLING_INCOME', 'R_GAMBLING_SAVINGS', 'R_GAMBLING_DEBT', 'T_GROCERIES_12', 'T_GROCERIES_6', 'R_GROCERIES', 'R_GROCERIES_INCOME', 'R_GROCERIES_SAVINGS', 'R_GROCERIES_DEBT', 'T_HEALTH_12', 'T_HEALTH_6', 'R_HEALTH', 'R_HEALTH_INCOME', 'R_HEALTH_SAVINGS', 'R_HEALTH_DEBT', 'T_HOUSING_12', 'T_HOUSING_6', 'R_HOUSING', 'R_HOUSING_INCOME', 'R_HOUSING_SAVINGS', 'R_HOUSING_DEBT', 'T_TAX_12', 'T_TAX_6', 'R_TAX', 'R_TAX_INCOME', 'R_TAX_SAVINGS', 'R_TAX_DEBT', 'T_TRAVEL_12', 'T_TRAVEL_6', 'R_TRAVEL', 'R_TRAVEL_INCOME', 'R_TRAVEL_SAVINGS', 'R_TRAVEL_DEBT', 'T_UTILITIES_12', 'T_UTILITIES_6', 'R_UTILITIES', 'R_UTILITIES_INCOME', 'R_UTILITIES_SAVINGS', 'R_UTILITIES_DEBT', 'T_EXPENDITURE_12', 'T_EXPENDITURE_6', 'R_EXPENDITURE', 'R_EXPENDITURE_INCOME', 'R_EXPENDITURE_SAVINGS', 'R_EXPENDITURE_DEBT', 'CAT_GAMBLING', 'CAT_DEBT', 'CAT_CREDIT_CARD', 'CAT_MORTGAGE', 'CAT_SAVINGS_ACCOUNT', 'CAT_DEPENDENTS', 'CREDIT_SCORE', 'DEFAULT'] ==> equal 86
# Distribution of CREDIT_SCORE
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Histogram with KDE
sns.histplot(df['CREDIT_SCORE'], kde=True, ax=axes[0], color='steelblue')
axes[0].set_title('Distribution of CREDIT_SCORE')
axes[0].axvline(df['CREDIT_SCORE'].mean(), color='red', linestyle='--', label=f"Mean: {df['CREDIT_SCORE'].mean():.1f}")
axes[0].axvline(df['CREDIT_SCORE'].median(), color='green', linestyle='--', label=f"Median: {df['CREDIT_SCORE'].median():.1f}")
axes[0].legend()
# Box plot
sns.boxplot(y=df['CREDIT_SCORE'], ax=axes[1], color='steelblue')
axes[1].set_title('Box Plot of CREDIT_SCORE')
# CREDIT_SCORE by DEFAULT
sns.boxplot(x='DEFAULT', y='CREDIT_SCORE', data=df, ax=axes[2], palette='Set2')
axes[2].set_title('CREDIT_SCORE by DEFAULT Status')
axes[2].set_xticklabels(['No Default (0)', 'Default (1)'])
plt.tight_layout()
plt.show()
# Statistics
print("CREDIT_SCORE Statistics:")
print(f" Skewness: {df['CREDIT_SCORE'].skew():.3f}")
print(f" Kurtosis: {df['CREDIT_SCORE'].kurtosis():.3f}")
print(f"\nCREDIT_SCORE by DEFAULT:")
print(df.groupby('DEFAULT')['CREDIT_SCORE'].describe())
df.groupby('DEFAULT')['CREDIT_SCORE'].describe()
/tmp/ipykernel_4647/946892219.py:16: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='DEFAULT', y='CREDIT_SCORE', data=df, ax=axes[2], palette='Set2') /tmp/ipykernel_4647/946892219.py:18: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axes[2].set_xticklabels(['No Default (0)', 'Default (1)'])
CREDIT_SCORE Statistics:
Skewness: -0.531
Kurtosis: 0.121
CREDIT_SCORE by DEFAULT:
count mean std min 25% 50% 75% max
DEFAULT
0 716.0 600.312675 51.735251 441.875 566.00 606.5 635.25 742.875
1 284.0 558.238996 62.273254 441.875 516.75 565.5 609.00 691.000
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| DEFAULT | ||||||||
| 0 | 716.0 | 600.312675 | 51.735251 | 441.875 | 566.00 | 606.5 | 635.25 | 742.875 |
| 1 | 284.0 | 558.238996 | 62.273254 | 441.875 | 516.75 | 565.5 | 609.00 | 691.000 |
The Credit Score was segmented into four categories (Poor, Fair, Good, Excellent) to analyze the default rate across risk levels.
| Credit_Score | Category |
|---|---|
| score < 500 | Poor |
| 500 <= score < 600 | Fair |
| 600 <= score < 700 | Good |
| score >= 700 | Excellent |
# Credit Score Segmentation
def credit_score_category(score):
if score < 500:
return 'Poor'
elif score < 600:
return 'Fair'
elif score < 700:
return 'Good'
else:
return 'Excellent'
df['CREDIT_CATEGORY'] = df['CREDIT_SCORE'].apply(credit_score_category)
# Analysis by category
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Count by category
category_order = ['Poor', 'Fair', 'Good', 'Excellent']
sns.countplot(x='CREDIT_CATEGORY', data=df, order=category_order, ax=axes[0], palette='RdYlGn')
axes[0].set_title('Distribution by Credit Score Category')
axes[0].set_xlabel('Credit Category')
# Default rate by category
default_rate = df.groupby('CREDIT_CATEGORY')['DEFAULT'].mean().reindex(category_order)
default_rate.plot(kind='bar', ax=axes[1], color=['red', 'orange', 'yellowgreen', 'green'])
axes[1].set_title('Default Rate by Credit Score Category')
axes[1].set_ylabel('Default Rate')
axes[1].set_xticklabels(category_order, rotation=0)
plt.tight_layout()
plt.show()
# Cross-tabulation
print("Cross-tabulation: CREDIT_CATEGORY vs DEFAULT")
print(pd.crosstab(df['CREDIT_CATEGORY'], df['DEFAULT'], margins=True, normalize='index').round(3))
/tmp/ipykernel_4647/934784022.py:19: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x='CREDIT_CATEGORY', data=df, order=category_order, ax=axes[0], palette='RdYlGn')
Cross-tabulation: CREDIT_CATEGORY vs DEFAULT DEFAULT 0 1 CREDIT_CATEGORY Excellent 1.000 0.000 Fair 0.677 0.323 Good 0.814 0.186 Poor 0.333 0.667 All 0.716 0.284
# Top correlated features with CREDIT_SCORE
credit_corr = df[numeric_cols].corr()['CREDIT_SCORE'].drop('CREDIT_SCORE').sort_values(key=abs, ascending=False)
print("Top 15 features correlated with CREDIT_SCORE:")
print(credit_corr.head(15))
print("\n" + "="*50)
# Top correlated features with DEFAULT
default_corr = df[numeric_cols].corr()['DEFAULT'].drop('DEFAULT').sort_values(key=abs, ascending=False)
print("\nTop 15 features correlated with DEFAULT:")
print(default_corr.head(15))
Top 15 features correlated with CREDIT_SCORE: R_DEBT_INCOME -0.781501 R_EXPENDITURE_DEBT 0.617799 R_DEBT_SAVINGS -0.606353 R_TAX_DEBT 0.597723 R_UTILITIES_DEBT 0.569899 R_ENTERTAINMENT_DEBT 0.560518 R_CLOTHING_DEBT 0.527886 R_GROCERIES_DEBT 0.515530 R_HEALTH_DEBT 0.500986 R_TRAVEL_DEBT 0.416605 DEBT -0.412932 R_HOUSING_DEBT 0.350996 R_UTILITIES_SAVINGS -0.344134 DEFAULT -0.326766 R_HEALTH_SAVINGS -0.268628 Name: CREDIT_SCORE, dtype: float64 ================================================== Top 15 features correlated with DEFAULT: CREDIT_SCORE -0.326766 R_DEBT_INCOME 0.253076 R_TAX_DEBT -0.246836 R_EXPENDITURE_DEBT -0.208215 R_ENTERTAINMENT_DEBT -0.199903 R_CLOTHING_DEBT -0.189810 R_DEBT_SAVINGS 0.187416 R_HEALTH_DEBT -0.177106 R_UTILITIES_DEBT -0.174043 R_GROCERIES_DEBT -0.157802 DEBT 0.125168 R_TRAVEL_DEBT -0.124004 CAT_CREDIT_CARD 0.119993 R_GROCERIES_SAVINGS 0.103225 R_UTILITIES_SAVINGS 0.095051 Name: DEFAULT, dtype: float64
# Visualize top correlated features
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Top 10 for CREDIT_SCORE
top_credit = credit_corr.head(10)
colors = ['green' if x > 0 else 'red' for x in top_credit.values]
top_credit.plot(kind='barh', ax=axes[0], color=colors)
axes[0].set_title('Top 10 Features Correlated with CREDIT_SCORE')
axes[0].set_xlabel('Correlation')
# Top 10 for DEFAULT
top_default = default_corr.head(10)
colors = ['green' if x > 0 else 'red' for x in top_default.values]
top_default.plot(kind='barh', ax=axes[1], color=colors)
axes[1].set_title('Top 10 Features Correlated with DEFAULT')
axes[1].set_xlabel('Correlation')
plt.tight_layout()
plt.show()
Top Features Correlated with CREDIT_SCORE: The strongest correlation is with features like R_EXPENDITURE_DEBT (positive) and R_DEBT_INCOME (negative). The highest correlated features are ratio variables that directly measure a customer's financial stability and leverage. High savings and low debt are consistently linked to a higher CREDIT_SCORE while a high ratio of debt to income should lower it.
Top Features Correlated with DEFAULT: The features correlated with DEFAULT are similar but with inverted signs, as expected (since high score means low default risk). R_DEBT_INCOME has a strong positive correlation, meaning high debt relative to income increases the likelihood of default. The strongest correlation seen is with R_EXPENDITURE_DEBT (positive) and T_HOUSING_6 (negative). The features that strongly predict a higher CREDIT_SCORE are the inverse predictors of DEFAULT.
top_features = credit_corr.head(6).index.tolist()
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()
for i, feature in enumerate(top_features):
axes[i].scatter(df[feature], df['CREDIT_SCORE'], alpha=0.5, c=df['DEFAULT'], cmap='coolwarm')
axes[i].set_xlabel(feature)
axes[i].set_ylabel('CREDIT_SCORE')
axes[i].set_title(f'{feature} vs CREDIT_SCORE (r={credit_corr[feature]:.3f})')
plt.tight_layout()
plt.show()
top_default_features = default_corr.head(6).index.tolist()
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()
for i, feature in enumerate(top_default_features):
sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
axes[i].set_title(f'{feature} by DEFAULT (r={default_corr[feature]:.3f})')
axes[i].set_xticklabels(['No Default', 'Default'])
plt.tight_layout()
plt.show()
/tmp/ipykernel_4647/2839064703.py:7: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2') /tmp/ipykernel_4647/2839064703.py:9: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axes[i].set_xticklabels(['No Default', 'Default']) /tmp/ipykernel_4647/2839064703.py:7: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2') /tmp/ipykernel_4647/2839064703.py:9: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axes[i].set_xticklabels(['No Default', 'Default']) /tmp/ipykernel_4647/2839064703.py:7: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2') /tmp/ipykernel_4647/2839064703.py:9: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axes[i].set_xticklabels(['No Default', 'Default']) /tmp/ipykernel_4647/2839064703.py:7: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2') /tmp/ipykernel_4647/2839064703.py:9: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axes[i].set_xticklabels(['No Default', 'Default']) /tmp/ipykernel_4647/2839064703.py:7: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2') /tmp/ipykernel_4647/2839064703.py:9: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axes[i].set_xticklabels(['No Default', 'Default']) /tmp/ipykernel_4647/2839064703.py:7: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2') /tmp/ipykernel_4647/2839064703.py:9: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axes[i].set_xticklabels(['No Default', 'Default'])
Six box plots were generated to visualize the relationship between the top-correlated numerical features and the DEFAULT status.
R_DEBT_INCOME:
The defaulter group (1) has a clearly higher median ratio than the non-defaulter group (0), confirming the 0.251 correlation. This suggests that customers with high debt relative to their income are more likely to default.
R_DEBT_SAVINGS:
Similar to R_DEBT_INCOME, the median R_DEBT_SAVINGS is significantly higher for defaulters, highlighting that having much more debt than savings is a strong indicator of risk.
R_EXPENDITURE_DEBT:
This ratio shows a slightly lower median for defaulters, though the difference is less pronounced than for the income and savings ratios.
# Analysis of binary categorical features vs DEFAULT
binary_cats = ['CAT_DEBT', 'CAT_CREDIT_CARD', 'CAT_MORTGAGE', 'CAT_SAVINGS_ACCOUNT', 'CAT_DEPENDENTS']
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()
for i, cat in enumerate(binary_cats):
# Default rate by category
default_rate = df.groupby(cat)['DEFAULT'].mean()
default_rate.plot(kind='bar', ax=axes[i], color=['steelblue', 'coral'])
axes[i].set_title(f'Default Rate by {cat}')
axes[i].set_ylabel('Default Rate')
axes[i].set_xticklabels(['No (0)', 'Yes (1)'], rotation=0)
# Add values on bars
for j, v in enumerate(default_rate.values):
axes[i].text(j, v + 0.01, f'{v:.2%}', ha='center')
axes[5].set_visible(False)
plt.tight_layout()
plt.show()
# Summary statistics
print("Default Rate by Binary Categorical Features:")
for cat in binary_cats:
rate = df.groupby(cat)['DEFAULT'].mean()
print(f"\n{cat}:")
print(f" No (0): {rate.get(0, 0):.2%}")
print(f" Yes (1): {rate.get(1, 0):.2%}")
Default Rate by Binary Categorical Features: CAT_DEBT: No (0): 21.43% Yes (1): 28.81% CAT_CREDIT_CARD: No (0): 25.39% Yes (1): 38.14% CAT_MORTGAGE: No (0): 27.57% Yes (1): 32.37% CAT_SAVINGS_ACCOUNT: No (0): 42.86% Yes (1): 28.30% CAT_DEPENDENTS: No (0): 27.18% Yes (1): 35.33%
The analysis included several binary categorical features (such as CAT_DEBT, CAT_MORTGAGE, etc.) to assess their impact on the default rate.
CAT_DEBT and CAT_CREDIT_CARD:
Having debt (CAT_DEBT = 1) slightly increases the default rate compared to not having debt (CAT_DEBT = 0). Similarly, having a credit card (CAT_CREDIT_CARD = 1) corresponds to a higher default rate than not having one. This suggests that access to these financial products, while common, may increase exposure to financial risk.
CAT_SAVINGS_ACCOUNT:
Customers who do not have a savings account (CAT_SAVINGS_ACCOUNT = 0) exhibit a higher default rate (42.86%) compared to those who do (28.30%). This is a crucial finding, indicating that the presence of a savings buffer acts as a protective factor against default.
CAT_DEPENDENTS:
Surprisingly, having dependents (CAT_DEPENDENTS = 1) only slightly increases the default rate compared to those without dependents (35.35% vs 27.18%), suggesting that this factor alone is not a strong predictor of default risk.
# Find highly correlated feature pairs (potential redundancy)
corr_matrix = df[numeric_cols].corr().abs()
# Get upper triangle
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Find pairs with correlation > 0.9
high_corr_pairs = []
for col in upper_tri.columns:
for idx in upper_tri.index:
if upper_tri.loc[idx, col] > 0.9:
high_corr_pairs.append((idx, col, upper_tri.loc[idx, col]))
# Sort by correlation
high_corr_pairs.sort(key=lambda x: x[2], reverse=True)
print(f"Found {len(high_corr_pairs)} highly correlated feature pairs (r > 0.9):")
print("="*70)
for feat1, feat2, corr_val in high_corr_pairs[:20]: # Show top 20
print(f"{feat1} <-> {feat2}: {corr_val:.3f}")
Found 32 highly correlated feature pairs (r > 0.9): ====================================================================== T_UTILITIES_12 <-> T_UTILITIES_6: 1.000 T_EDUCATION_12 <-> T_EDUCATION_6: 1.000 T_HOUSING_12 <-> T_HOUSING_6: 1.000 T_GAMBLING_12 <-> T_GAMBLING_6: 1.000 T_TAX_12 <-> T_TAX_6: 0.999 T_GROCERIES_12 <-> T_GROCERIES_6: 0.996 T_ENTERTAINMENT_12 <-> T_ENTERTAINMENT_6: 0.994 T_EXPENDITURE_12 <-> T_EXPENDITURE_6: 0.989 INCOME <-> T_EXPENDITURE_12: 0.977 T_TRAVEL_12 <-> T_TRAVEL_6: 0.975 T_GAMBLING_6 <-> CAT_GAMBLING: 0.970 T_GAMBLING_12 <-> CAT_GAMBLING: 0.969 T_GROCERIES_12 <-> T_UTILITIES_6: 0.965 T_GROCERIES_12 <-> T_UTILITIES_12: 0.964 INCOME <-> T_EXPENDITURE_6: 0.963 T_GROCERIES_6 <-> T_UTILITIES_6: 0.953 T_GROCERIES_6 <-> T_UTILITIES_12: 0.952 T_TAX_6 <-> T_UTILITIES_6: 0.927 T_TAX_12 <-> T_UTILITIES_6: 0.927 T_TAX_6 <-> T_UTILITIES_12: 0.926
The original dataset contains a large number of numerical variables derived from financial indicators, spending categories, and ratio-based measures. Many of these variables are highly correlated, which can lead to multicollinearity issues and reduce model stability. To address this, Principal Component Analysis (PCA) was applied to reduce dimensionality, remove correlations between variables, and extract latent structures representative of financial behavior.
Before applying Principal Component Analysis (PCA), all numerical variables were robustly scaled in order to reduce the influence of extreme values and outliers, which are common in financial data. Unlike standard scaling based on the mean and standard deviation, robust scaling relies on the median and the interquartile range (IQR), making it more stable for skewed distributions.
from sklearn.preprocessing import RobustScaler
x = df.drop(columns=['CREDIT_SCORE', 'DEFAULT', 'CREDIT_CATEGORY'])
y_class = df['DEFAULT']
y_reg = df['CREDIT_SCORE']
y_seg = df['CREDIT_CATEGORY']
scaler = RobustScaler()
x_scaled = pd.DataFrame(scaler.fit_transform(x), columns=x.columns)
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(x_scaled)
PCA()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
| n_components | None | |
| copy | True | |
| whiten | False | |
| svd_solver | 'auto' | |
| tol | 0.0 | |
| iterated_power | 'auto' | |
| n_oversamples | 10 | |
| power_iteration_normalizer | 'auto' | |
| random_state | None |
# Cumulative variance
cum_var = np.cumsum(pca.explained_variance_ratio_)
# Choose threshold
threshold = 0.90 # 90% explained variance
n_components = np.argmax(cum_var >= threshold) + 1
plt.figure(figsize=(10,5))
plt.plot(cum_var, marker='o', label="Cumulative Explained Variance")
# Horizontal threshold
plt.axhline(threshold, color='red', linestyle='--', label=f"{int(threshold*100)}% threshold")
# Vertical line marking n_components
plt.axvline(n_components-1, color='green', linestyle='--')
plt.text(
n_components-1, threshold + 0.02
, f"n_components = {n_components}"
, color = 'green'
, fontsize = 12
)
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("PCA Scree Plot with Threshold")
plt.grid(True)
plt.legend()
plt.show()
print(f"Selected number of components: {n_components}")
Selected number of components: 18
Based on PCA scree plot: Based on the eigenvalue plot and the cumulative explained variance, a threshold of 90% explained variance was established. This threshold is reached when 18 principal components are retained. Beyond this point, additional components provide only marginal gains in terms of explained variance.
pca = PCA(n_components=n_components)
x_pca = pca.fit_transform(x_scaled)
pca.explained_variance_ratio_
array([0.16907334, 0.14337906, 0.10834535, 0.08341869, 0.07777103,
0.06228348, 0.04958545, 0.04526524, 0.0350692 , 0.03352146,
0.01733876, 0.01395499, 0.01284695, 0.01241365, 0.01101743,
0.01076421, 0.00934558, 0.0086326 ])
pca.explained_variance_ratio_.sum()
np.float64(0.904026468600927)
loadings = pd.DataFrame(
pca.components_.T
, columns = [f'PC{i+1}' for i in range(n_components)]
, index = x.columns
)
pc_tables = {}
for i in range(n_components):
pc_name = f'PC{i+1}'
top_vals = loadings[pc_name].abs().sort_values(ascending=False).head(10)
pc_tables[pc_name] = pd.DataFrame({
'Variable' : top_vals.index
, 'Loading Value' : top_vals.values
})
for pc, table in pc_tables.items():
print(f"\n==== {pc} ====")
display(table)
==== PC1 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_GAMBLING_SAVINGS | 0.264770 |
| 1 | INCOME | 0.215508 |
| 2 | T_EXPENDITURE_6 | 0.208653 |
| 3 | T_CLOTHING_12 | 0.201097 |
| 4 | T_EXPENDITURE_12 | 0.200726 |
| 5 | T_TAX_12 | 0.200374 |
| 6 | T_TAX_6 | 0.200075 |
| 7 | R_EXPENDITURE_INCOME | 0.199031 |
| 8 | T_GROCERIES_6 | 0.196335 |
| 9 | T_UTILITIES_12 | 0.195952 |
==== PC2 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_GAMBLING_SAVINGS | 0.621232 |
| 1 | R_EXPENDITURE_SAVINGS | 0.167549 |
| 2 | R_EDUCATION_SAVINGS | 0.159937 |
| 3 | SAVINGS | 0.156874 |
| 4 | R_GROCERIES_SAVINGS | 0.156814 |
| 5 | R_SAVINGS_INCOME | 0.155187 |
| 6 | R_GAMBLING_INCOME | 0.154184 |
| 7 | R_ENTERTAINMENT_SAVINGS | 0.150321 |
| 8 | R_HEALTH_INCOME | 0.146570 |
| 9 | R_HEALTH_DEBT | 0.142328 |
==== PC3 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_TAX | 0.962009 |
| 1 | T_FINES_12 | 0.096223 |
| 2 | R_FINES_SAVINGS | 0.088900 |
| 3 | T_FINES_6 | 0.081871 |
| 4 | R_FINES_INCOME | 0.080637 |
| 5 | R_FINES_DEBT | 0.078642 |
| 6 | T_TRAVEL_12 | 0.055328 |
| 7 | T_TRAVEL_6 | 0.046539 |
| 8 | R_HEALTH_DEBT | 0.044465 |
| 9 | R_EXPENDITURE_INCOME | 0.038941 |
==== PC4 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_GAMBLING_SAVINGS | 0.523514 |
| 1 | R_EDUCATION_SAVINGS | 0.209941 |
| 2 | R_GAMBLING_DEBT | 0.206583 |
| 3 | R_HEALTH_DEBT | 0.200184 |
| 4 | R_GAMBLING_INCOME | 0.192789 |
| 5 | R_UTILITIES_DEBT | 0.171324 |
| 6 | R_HEALTH_INCOME | 0.167269 |
| 7 | R_GROCERIES_SAVINGS | 0.163180 |
| 8 | R_GROCERIES_DEBT | 0.162360 |
| 9 | R_TRAVEL_SAVINGS | 0.154809 |
==== PC5 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | T_FINES_12 | 0.458215 |
| 1 | R_FINES_INCOME | 0.438916 |
| 2 | T_FINES_6 | 0.412101 |
| 3 | R_FINES_SAVINGS | 0.395987 |
| 4 | R_FINES_DEBT | 0.395031 |
| 5 | R_GAMBLING_SAVINGS | 0.176785 |
| 6 | R_TAX | 0.173157 |
| 7 | R_UTILITIES_INCOME | 0.072223 |
| 8 | R_TRAVEL_INCOME | 0.064162 |
| 9 | R_EXPENDITURE | 0.047921 |
==== PC6 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_TRAVEL_INCOME | 0.321089 |
| 1 | R_EXPENDITURE | 0.302117 |
| 2 | R_UTILITIES_INCOME | 0.277703 |
| 3 | T_TRAVEL_6 | 0.276244 |
| 4 | R_GROCERIES | 0.265586 |
| 5 | T_TRAVEL_12 | 0.258799 |
| 6 | R_TRAVEL | 0.242218 |
| 7 | R_TRAVEL_DEBT | 0.200495 |
| 8 | R_GROCERIES_INCOME | 0.194468 |
| 9 | R_CLOTHING_DEBT | 0.163115 |
==== PC7 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_TAX_DEBT | 0.370426 |
| 1 | R_ENTERTAINMENT_INCOME | 0.260088 |
| 2 | R_CLOTHING_DEBT | 0.246320 |
| 3 | R_ENTERTAINMENT_DEBT | 0.225659 |
| 4 | T_ENTERTAINMENT_6 | 0.221338 |
| 5 | T_ENTERTAINMENT_12 | 0.214602 |
| 6 | R_TAX_SAVINGS | 0.191690 |
| 7 | R_ENTERTAINMENT_SAVINGS | 0.186203 |
| 8 | R_EXPENDITURE | 0.184222 |
| 9 | R_DEBT_INCOME | 0.180375 |
==== PC8 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_EXPENDITURE | 0.337961 |
| 1 | R_GROCERIES | 0.321378 |
| 2 | T_TRAVEL_12 | 0.307650 |
| 3 | R_HEALTH_SAVINGS | 0.288540 |
| 4 | R_TRAVEL | 0.252120 |
| 5 | R_HEALTH | 0.246770 |
| 6 | R_TRAVEL_INCOME | 0.226208 |
| 7 | R_CLOTHING_INCOME | 0.220396 |
| 8 | T_TRAVEL_6 | 0.213178 |
| 9 | R_CLOTHING | 0.212854 |
==== PC9 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_EDUCATION_SAVINGS | 0.487227 |
| 1 | R_EDUCATION_DEBT | 0.420181 |
| 2 | R_EDUCATION_INCOME | 0.345373 |
| 3 | T_EDUCATION_12 | 0.265766 |
| 4 | T_EDUCATION_6 | 0.265618 |
| 5 | R_HEALTH_SAVINGS | 0.192148 |
| 6 | R_HOUSING_SAVINGS | 0.159801 |
| 7 | R_CLOTHING_INCOME | 0.155300 |
| 8 | R_UTILITIES_SAVINGS | 0.151580 |
| 9 | R_ENTERTAINMENT_INCOME | 0.136639 |
==== PC10 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_HOUSING_DEBT | 0.463402 |
| 1 | R_HOUSING_SAVINGS | 0.408819 |
| 2 | R_HOUSING_INCOME | 0.386008 |
| 3 | T_HOUSING_12 | 0.346977 |
| 4 | T_HOUSING_6 | 0.346896 |
| 5 | R_UTILITIES_INCOME | 0.208940 |
| 6 | R_TRAVEL_INCOME | 0.156058 |
| 7 | CAT_MORTGAGE | 0.147646 |
| 8 | R_GROCERIES_INCOME | 0.121152 |
| 9 | R_HEALTH_INCOME | 0.111284 |
==== PC11 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_HEALTH_SAVINGS | 0.453949 |
| 1 | R_ENTERTAINMENT_INCOME | 0.268149 |
| 2 | R_UTILITIES_SAVINGS | 0.259457 |
| 3 | R_HEALTH_INCOME | 0.218178 |
| 4 | T_TRAVEL_12 | 0.214122 |
| 5 | R_TAX_INCOME | 0.200787 |
| 6 | R_EXPENDITURE | 0.184827 |
| 7 | R_HEALTH_DEBT | 0.175331 |
| 8 | T_TRAVEL_6 | 0.171943 |
| 9 | R_TRAVEL_DEBT | 0.155098 |
==== PC12 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_EXPENDITURE_INCOME | 0.385980 |
| 1 | R_CLOTHING_INCOME | 0.305621 |
| 2 | R_GROCERIES_SAVINGS | 0.296337 |
| 3 | R_EDUCATION | 0.281494 |
| 4 | R_GROCERIES_INCOME | 0.251981 |
| 5 | R_FINES_DEBT | 0.212099 |
| 6 | R_ENTERTAINMENT_SAVINGS | 0.192833 |
| 7 | T_GROCERIES_6 | 0.164359 |
| 8 | T_GROCERIES_12 | 0.162690 |
| 9 | R_EDUCATION_INCOME | 0.161863 |
==== PC13 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_GAMBLING | 0.750523 |
| 1 | R_EDUCATION | 0.514478 |
| 2 | T_FINES_6 | 0.146086 |
| 3 | R_FINES_DEBT | 0.130870 |
| 4 | R_ENTERTAINMENT | 0.127652 |
| 5 | R_EXPENDITURE_INCOME | 0.110343 |
| 6 | R_FINES | 0.109316 |
| 7 | R_ENTERTAINMENT_INCOME | 0.085537 |
| 8 | R_HEALTH | 0.081543 |
| 9 | R_TAX_DEBT | 0.077120 |
==== PC14 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_EDUCATION | 0.506591 |
| 1 | R_GAMBLING | 0.390975 |
| 2 | R_CLOTHING_INCOME | 0.342160 |
| 3 | R_EXPENDITURE_INCOME | 0.271218 |
| 4 | R_GROCERIES_SAVINGS | 0.255939 |
| 5 | R_FINES | 0.206527 |
| 6 | R_TAX_DEBT | 0.171493 |
| 7 | R_FINES_DEBT | 0.141373 |
| 8 | R_CLOTHING_DEBT | 0.138850 |
| 9 | R_SAVINGS_INCOME | 0.122553 |
==== PC15 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_ENTERTAINMENT | 0.430933 |
| 1 | R_EDUCATION | 0.405239 |
| 2 | R_FINES | 0.317757 |
| 3 | R_EXPENDITURE_INCOME | 0.317470 |
| 4 | R_ENTERTAINMENT_INCOME | 0.252775 |
| 5 | R_TAX_DEBT | 0.230853 |
| 6 | R_GROCERIES | 0.197504 |
| 7 | T_ENTERTAINMENT_6 | 0.180592 |
| 8 | R_GROCERIES_SAVINGS | 0.176899 |
| 9 | R_HEALTH_INCOME | 0.168013 |
==== PC16 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_FINES | 0.887295 |
| 1 | R_EDUCATION | 0.343998 |
| 2 | R_ENTERTAINMENT | 0.144798 |
| 3 | R_FINES_SAVINGS | 0.114536 |
| 4 | R_ENTERTAINMENT_INCOME | 0.086826 |
| 5 | R_FINES_DEBT | 0.066263 |
| 6 | R_CLOTHING_INCOME | 0.064834 |
| 7 | R_EDUCATION_SAVINGS | 0.063090 |
| 8 | T_ENTERTAINMENT_6 | 0.062206 |
| 9 | T_FINES_12 | 0.062132 |
==== PC17 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_GAMBLING | 0.418978 |
| 1 | R_EXPENDITURE_INCOME | 0.374325 |
| 2 | R_ENTERTAINMENT | 0.361073 |
| 3 | R_CLOTHING_INCOME | 0.324506 |
| 4 | R_TRAVEL | 0.273148 |
| 5 | R_EDUCATION | 0.261715 |
| 6 | R_EXPENDITURE | 0.195481 |
| 7 | R_HEALTH_SAVINGS | 0.167330 |
| 8 | R_GROCERIES | 0.157019 |
| 9 | R_HEALTH | 0.155118 |
==== PC18 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_GROCERIES | 0.616102 |
| 1 | R_TRAVEL | 0.459643 |
| 2 | R_HEALTH | 0.377438 |
| 3 | R_ENTERTAINMENT | 0.185389 |
| 4 | R_CLOTHING | 0.185278 |
| 5 | R_EXPENDITURE_INCOME | 0.153569 |
| 6 | R_UTILITIES_INCOME | 0.146982 |
| 7 | R_GAMBLING | 0.127939 |
| 8 | R_FINES_DEBT | 0.114547 |
| 9 | R_TAX_INCOME | 0.113939 |
# PCA 2D Scatter Plot by Credit Category
plt.figure(figsize=(10,7))
plt.scatter(x_pca[:, 0], x_pca[:, 1], c=y_seg.map({'Poor':0,'Fair':1,'Good':2,'Excellent':3}), cmap='viridis')
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA 2D Scatter Plot by Credit Category")
plt.colorbar()
plt.show()
When points are colored by credit category shown in plot above, no clear separation between categories is observed. All categories largely overlap in the PCA space, indicating that the main components capture general financial behavior rather than predefined credit classes.
# PCA 2D Scatter by DEFAULT
plt.figure(figsize=(10,7))
plt.scatter(x_pca[:,0], x_pca[:,1], c=y_class, cmap='coolwarm', alpha=0.7)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA 2D Scatter by DEFAULT")
plt.colorbar()
plt.show()
When coloring by DEFAULT status in plot PCA 2D PCA Scatter by DEFAULT, defaulters and non-defaulters are also highly mixed. There is no clear boundary separating the two groups in the PC1–PC2 plane. This confirms that PCA, as an unsupervised method, does not directly discriminate default risk.
# PCA 2D Scatter by CREDIT_SCORE
plt.figure(figsize=(10,7))
plt.scatter(x_pca[:,0], x_pca[:,1], c=y_reg, cmap='plasma', alpha=0.7)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA 2D Scatter by CREDIT_SCORE")
plt.colorbar()
plt.show()
when points are colored by CREDIT SCORE in plot above, a gradual color transition can be observed rather than sharp clusters. Higher and lower credit scores are spread across the space, with slight local trends but no clear linear separation.
Overally, The PCA plot (PC1 vs PC2) shows that the four credit category, risk default and credit score overlap heavily , no clear clusters nor regression line. This means the main direction of variation in data mostly by spending and saving behavior not align with credit categories.
Since PCA is unsupervised and focuses on overall variance, its normal that the credit froups are not visually seperated in the first 2 components.
With PCA data, we can also work toward a new task : Customer Financial Behavior Segmentation.
df_pca = pd.DataFrame(x_pca, columns=[f'PC{i+1}' for i in range(n_components)])
df_pca["DEFAULT"] = y_class
df_pca["CREDIT_SCORE"] = y_reg
df_pca["CATEGORY"] = y_seg
df_pca.head()
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | ... | PC12 | PC13 | PC14 | PC15 | PC16 | PC17 | PC18 | DEFAULT | CREDIT_SCORE | CATEGORY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.263568 | 1.675703 | 1.616015 | -1.777544 | 0.900540 | 0.361975 | -2.571286 | -0.326115 | 0.336299 | 0.484445 | ... | -1.161316 | -0.392118 | -0.575822 | 0.026258 | 1.003179 | -0.838064 | 0.592423 | 1 | 444.0 | Poor |
| 1 | -1.113713 | 0.097301 | -0.083219 | -1.488284 | -0.278663 | -1.442984 | 1.640455 | -2.546311 | -2.071634 | 1.182105 | ... | 0.090272 | 0.011836 | -1.387963 | -1.021540 | 0.263893 | -0.475882 | 0.268768 | 0 | 625.0 | Good |
| 2 | 0.244502 | 6.755585 | -0.368538 | 1.002296 | 4.414212 | 1.369341 | -3.355328 | 0.337632 | -1.563512 | 0.385298 | ... | 0.491544 | -0.693706 | -0.040053 | 0.396325 | 0.708798 | 0.652537 | 1.179665 | 1 | 469.0 | Poor |
| 3 | 2.228464 | 5.653206 | 2.395674 | 0.698185 | -1.647916 | 0.834930 | 0.376260 | -0.190960 | 0.288253 | 1.457348 | ... | -0.885084 | 0.668575 | -0.342176 | 0.983585 | -1.665101 | -0.285464 | 0.304583 | 0 | 559.0 | Fair |
| 4 | 1.300493 | -2.019472 | -2.782794 | 0.626834 | -0.635844 | -2.531894 | -2.460998 | -1.658941 | 0.163559 | 0.804533 | ... | 0.938645 | 0.124539 | 0.018004 | 0.481896 | -0.458401 | -0.139912 | -0.680096 | 0 | 473.0 | Poor |
5 rows × 21 columns
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
# Use PC1–PC10 for clustering
X_cluster = df_pca[[f'PC{i+1}' for i in range(n_components)]]
sil_scores = []
K_range = range(2, 10)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(X_cluster)
sil = silhouette_score(X_cluster, labels)
sil_scores.append(sil)
print(f"K={k} Silhouette Score = {sil:.4f}")
plt.figure(figsize=(6,4))
plt.plot(K_range, sil_scores, marker='o')
plt.xlabel("Number of clusters (K)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Analysis")
plt.grid(True)
plt.show()
K=2 Silhouette Score = 0.1493 K=3 Silhouette Score = 0.1649 K=4 Silhouette Score = 0.1651 K=5 Silhouette Score = 0.1705 K=6 Silhouette Score = 0.1833 K=7 Silhouette Score = 0.1672 K=8 Silhouette Score = 0.1787 K=9 Silhouette Score = 0.1664
from sklearn.cluster import KMeans
# Use PC1–PC10 for clustering
X_cluster = df_pca[[f'PC{i+1}' for i in range(n_components)]]
kmeans = KMeans(n_clusters=6, random_state=42)
clusters = kmeans.fit_predict(X_cluster)
# add cluster labels
df_pca['Cluster'] = clusters
df_pca['Cluster'].unique()
array([2, 5, 3, 0, 4, 1], dtype=int32)
df_pca.head()
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | ... | PC13 | PC14 | PC15 | PC16 | PC17 | PC18 | DEFAULT | CREDIT_SCORE | CATEGORY | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.263568 | 1.675703 | 1.616015 | -1.777544 | 0.900540 | 0.361975 | -2.571286 | -0.326115 | 0.336299 | 0.484445 | ... | -0.392118 | -0.575822 | 0.026258 | 1.003179 | -0.838064 | 0.592423 | 1 | 444.0 | Poor | 2 |
| 1 | -1.113713 | 0.097301 | -0.083219 | -1.488284 | -0.278663 | -1.442984 | 1.640455 | -2.546311 | -2.071634 | 1.182105 | ... | 0.011836 | -1.387963 | -1.021540 | 0.263893 | -0.475882 | 0.268768 | 0 | 625.0 | Good | 2 |
| 2 | 0.244502 | 6.755585 | -0.368538 | 1.002296 | 4.414212 | 1.369341 | -3.355328 | 0.337632 | -1.563512 | 0.385298 | ... | -0.693706 | -0.040053 | 0.396325 | 0.708798 | 0.652537 | 1.179665 | 1 | 469.0 | Poor | 5 |
| 3 | 2.228464 | 5.653206 | 2.395674 | 0.698185 | -1.647916 | 0.834930 | 0.376260 | -0.190960 | 0.288253 | 1.457348 | ... | 0.668575 | -0.342176 | 0.983585 | -1.665101 | -0.285464 | 0.304583 | 0 | 559.0 | Fair | 3 |
| 4 | 1.300493 | -2.019472 | -2.782794 | 0.626834 | -0.635844 | -2.531894 | -2.460998 | -1.658941 | 0.163559 | 0.804533 | ... | 0.124539 | 0.018004 | 0.481896 | -0.458401 | -0.139912 | -0.680096 | 0 | 473.0 | Poor | 0 |
5 rows × 22 columns
cluster_profile = df_pca.groupby('Cluster')[ [f'PC{i+1}' for i in range(n_components)] ].mean().round(2)
cluster_profile
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | PC13 | PC14 | PC15 | PC16 | PC17 | PC18 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster | ||||||||||||||||||
| 0 | 1.84 | -1.58 | -0.18 | -1.18 | -0.17 | -0.51 | 0.07 | -0.03 | 0.10 | 0.03 | -0.09 | 0.06 | 0.07 | 0.04 | -0.07 | -0.02 | -0.04 | -0.02 |
| 1 | -1.29 | 5.18 | 0.58 | 1.40 | -1.56 | -0.75 | -1.47 | 0.48 | 0.42 | -0.34 | 0.52 | -0.31 | -0.10 | 0.04 | -0.09 | 0.11 | 0.21 | -0.03 |
| 2 | -3.44 | 1.33 | 0.09 | -2.11 | -0.59 | -0.08 | -0.07 | 0.01 | -0.23 | 0.09 | -0.20 | 0.12 | -0.01 | 0.14 | -0.10 | -0.01 | -0.04 | -0.02 |
| 3 | 3.19 | 2.48 | 0.05 | 1.77 | -1.25 | 0.49 | 0.40 | -0.32 | -0.12 | 0.01 | -0.30 | -0.03 | 0.03 | 0.03 | 0.02 | -0.06 | -0.03 | 0.02 |
| 4 | -2.26 | -2.51 | 0.63 | 1.94 | 0.06 | 0.51 | 0.11 | 0.17 | -0.03 | -0.08 | 0.39 | -0.10 | -0.07 | -0.19 | 0.20 | 0.03 | 0.07 | 0.05 |
| 5 | 0.56 | 2.51 | -1.90 | 0.21 | 6.44 | 0.64 | -0.30 | -0.04 | 0.11 | 0.12 | 0.07 | -0.01 | -0.10 | -0.10 | -0.03 | 0.10 | 0.01 | -0.01 |
Cluster 0 — Financially Stable, Efficient Savers
Cluster 1 — Moderate Lifestyle Spend + Weak Savings
Cluster 2 — High Lifestyle Cost + Behavioral Risk
Cluster 3 — High Spending + Stress from Gambling/Education/Health
plt.figure(figsize=(10, 8))
sns.scatterplot(
x = 'PC1'
, y = 'PC2'
, hue = 'Cluster'
, data = df_pca
, palette = 'viridis'
, s = 70
, alpha = 0.7
, legend= 'full'
)
plt.title("Scatter Plot of PC1 vs PC2 (After Clustering)")
Text(0.5, 1.0, 'Scatter Plot of PC1 vs PC2 (After Clustering)')
# Select the first n Principal Components
pc_cols = [f'PC{i+1}' for i in range(n_components)]
cluster_profile_plot = cluster_profile[pc_cols]
# Prepare data for plotting
cluster_profile_melted = cluster_profile_plot.reset_index().melt(
id_vars = 'Cluster'
, var_name = 'Principal Component'
, value_name= 'Mean Score'
)
plt.figure(figsize=(12, 6))
sns.barplot(
x = 'Principal Component'
, y = 'Mean Score'
, hue ='Cluster'
, data = cluster_profile_melted
, palette= 'viridis'
)
plt.title('Cluster Profile: Mean Principal Component Scores (Centroids)')
plt.ylabel('Mean PC Score')
plt.xlabel('Principal Component')
plt.axhline(0, color='black', linestyle='--', linewidth=1) # Reference line at 0
plt.legend(title='Cluster ID')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.savefig('cluster_profile_bar_chart.png')
plt.show()
# Feature engineering
# --- 1. Drop Non-Feature Columns ---
if 'CUST_ID' in df.columns:
df = df.drop(columns=['CUST_ID'])
# 2. Drop Non-feature columns
if 'Cluster_ID' in df.columns:
# One-hot encode the cluster ID columns
cluster_dummies = pd.get_dummies(df['Cluster_ID'], prefix='CLUSTER', drop_first=True)
df = pd.concat([df.drop('Cluster_ID', axis=1), cluster_dummies], axis=1)
print("Cluster IDs successfully integrated and one-hot encoded. ")
else:
print("Warning: 'Cluster_ID' not found. Skipping cluster integration.")
Warning: 'Cluster_ID' not found. Skipping cluster integration.
Analysis shows:
Strategy: Create higher-order features to capture:
While PCA focuses on variance maximization within a single set of variables, it does not explicitly measure relationships between different groups of variables. To analyze the dependency structure between financial indicators and spending behavior variables, we apply Canonical Correlation Analysis (CCA). CCA finds linear relationships between two sets of variables.
Here we analyze:
Group 1: Financial status variables
Group 2: Spending behavior variables
CCA seeks linear combinations of two variable blocks that are maximally correlated.
Data Matrices
Let
$$ X \in \mathbb{R}^{n \times p} $$
denote the matrix of financial variables (income, savings, debt, financial ratios), and
$$ Y \in \mathbb{R}^{n \times q} $$
denote the matrix of spending-related variables (housing, gambling, leisure, health, travel, etc.).
Both matrices are centered and standardized.
Canonical Variates
CCA aims to find vectors $a \in \mathbb{R}^p$ and $b \in \mathbb{R}^q$ such that the linear projections
$$ u = Xa, \qquad v = Yb $$
are maximally correlated.
Optimization Problem
The canonical correlation is obtained by solving:
$$ \max_{a,b} \; \mathrm{corr}(Xa, Yb) = \frac{a^\top \Sigma_{XY} b} {\sqrt{a^\top \Sigma_{XX} a \; b^\top \Sigma_{YY} b}}, $$
subject to the constraints:
$$ a^\top \Sigma_{XX} a = 1, \qquad b^\top \Sigma_{YY} b = 1, $$
where $\Sigma_{XX}$ and $\Sigma_{YY}$ are within-set covariance matrices and
$\Sigma_{XY}$ is the cross-covariance matrix.
Generalized Eigenvalue Problems
The optimization leads to the following eigenvalue equations:
$$ \Sigma_{XX}^{-1} \Sigma_{XY} \Sigma_{YY}^{-1} \Sigma_{YX} a = \rho^2 a, $$
$$ \Sigma_{YY}^{-1} \Sigma_{YX} \Sigma_{XX}^{-1} \Sigma_{XY} b = \rho^2 b, $$
where $\rho$ is the canonical correlation coefficient.
# Group 1 — Financial condition
X_financial = df[["INCOME", "SAVINGS", "DEBT", "R_SAVINGS_INCOME", "R_DEBT_INCOME"]].copy()
# Group 2 — Behavioral spending variables (use key 12-month expenditures)
spending_cols = [
"T_GROCERIES_12"
, "T_CLOTHING_12"
, "T_EDUCATION_12"
, "T_HEALTH_12"
, "T_TRAVEL_12"
, "T_ENTERTAINMENT_12"
, "T_GAMBLING_12"
, "T_HOUSING_12"
]
X_spending = df[spending_cols].copy()
# Scale both groups
sc1 = StandardScaler()
sc2 = StandardScaler()
X1 = sc1.fit_transform(X_financial)
X2 = sc2.fit_transform(X_spending)
# Fit CCA
cca = CCA(n_components=2)
X1_c, X2_c = cca.fit_transform(X1, X2)
rho1 = np.corrcoef(X1_c[:, 0], X2_c[:, 0])[0, 1]
rho2 = np.corrcoef(X1_c[:, 1], X2_c[:, 1])[0, 1]
print("Canonical correlations:", rho1, rho2)
# Visualization
plt.figure(figsize=(8,6))
plt.scatter(X1_c[:,0], X2_c[:,0], alpha=0.6, c=df["DEFAULT"], cmap="coolwarm")
plt.xlabel("Canonical Variable 1 — Financial")
plt.ylabel("Canonical Variable 1 — Spending Behavior")
plt.title("CCA: Relationship Between Financial Status and Spending Behavior")
plt.colorbar(label="DEFAULT")
plt.grid(True)
plt.show()
Canonical correlations: 0.9914443873851756 0.7537951088810095
Canonical Correlation Analysis (CCA) shows an extremely strong association between financial indicators (income, savings, debt ratios) and spending behavior. The first canonical correlation is 0.991, indicating almost perfect alignment between the two multivariate structures. A second independent dimension $(\rho = 0.754)$ reveals a secondary but still strong financial–behavior pattern. Overall, financial status and spending patterns are highly coupled in this dataset, and even defaulting customers follow the same structure, only shifted toward the lower end of financial strength.
Principal Component Analysis (PCA) is designed for continuous numerical variables and is therefore not suitable for studying relationships between categorical variables. To analyze associations between categorical attributes and default behavior, we apply Factorial Correspondence Analysis (FCA), also known as Correspondence Analysis. FCA is used to analyze relationships between two categorical variables. Here we study:
FCA is based on the analysis of a contingency table constructed from two categorical variables.
In this study, it is applied to cross-tabulations such as CREDIT_CATEGORY vs DEFAULT and
CAT_GAMBLING vs DEFAULT to visualize and quantify their dependency structure.
Contingency Table
Let $N = (n_{ij})$ denote the contingency table with $I$ row categories and $J$ column categories, and let
$$ n = \sum_{i=1}^{I} \sum_{j=1}^{J} n_{ij}. $$
The matrix of relative frequencies is defined as:
$$ P = \frac{1}{n} N. $$
Row and Column Masses
The row and column marginal distributions (masses) are given by:
$$ r_i = \sum_{j=1}^{J} p_{ij}, \qquad c_j = \sum_{i=1}^{I} p_{ij}. $$
Let $r = (r_1, \dots, r_I)^\top$ and $c = (c_1, \dots, c_J)^\top$, and define the diagonal matrices:
$$ D_r = \mathrm{diag}(r_1, \dots, r_I), \qquad D_c = \mathrm{diag}(c_1, \dots, c_J). $$
Independence Model and Standardized Residuals
Under the hypothesis of independence between the two categorical variables, the expected frequency matrix is:
$$ r c^\top. $$
FCA analyzes deviations from this independence model using the standardized residual matrix:
$$ S = D_r^{-1/2} (P - r c^\top) D_c^{-1/2}. $$
Singular Value Decomposition
A Singular Value Decomposition (SVD) of $S$ is performed:
$$ S = U \Delta V^\top, $$
where $\Delta = \mathrm{diag}(\delta_1, \delta_2, \dots)$ contains the singular values.
The associated eigenvalues
$$ \lambda_k = \delta_k^2 $$
represent the inertia explained by each factorial axis.
Principal Coordinates
The principal coordinates of the row and column categories are respectively given by:
$$ F = D_r^{-1/2} U \Delta, \qquad G = D_c^{-1/2} V \Delta. $$
# 1. CREDIT_CATEGORY vs DEFAULT
ct_credit = pd.crosstab(df["CREDIT_CATEGORY"], df["DEFAULT"])
ca_credit = prince.CA(n_components=2, random_state=42)
ca_credit = ca_credit.fit(ct_credit)
rows_credit = ca_credit.row_coordinates(ct_credit)
cols_credit = ca_credit.column_coordinates(ct_credit)
print(rows_credit)
print(cols_credit)
plt.figure(figsize=(10, 2))
# Plot rows (categories)
plt.scatter(rows_credit[0], [0]*len(rows_credit), color="blue", label="CREDIT_CATEGORY")
for i, label in enumerate(rows_credit.index):
plt.text(rows_credit.iloc[i, 0], 0.02, label, color="blue")
# Plot columns (DEFAULT)
plt.scatter(cols_credit[0], [0]*len(cols_credit), color="red", label="DEFAULT")
for i, label in enumerate(cols_credit.index):
plt.text(cols_credit.iloc[i, 0], -0.02, label, color="red")
plt.axvline(0, color="black", linewidth=0.5)
plt.yticks([])
plt.xlabel("FCA Dimension 1")
plt.title("1D Correspondence Analysis: CREDIT_CATEGORY vs DEFAULT")
plt.legend()
plt.savefig(f"{figure_dir}/fca_analysis_credit_category.png")
plt.show()
0
CREDIT_CATEGORY
Excellent -0.629800
Fair 0.087361
Good -0.218432
Poor 0.848604
0
DEFAULT
0 -0.189525
1 0.477818
The result we get
| Category | FCA Coordinate | Interpretation |
|---|---|---|
| Excellent | −0.6298 | Strongly aligned with non-default |
| Good | −0.2184 | Slightly aligned with non-default |
| Fair | +0.0873 | Slightly leaning toward risk |
| Poor | +0.8486 | Strongly aligned with default |
Which we can interprete that, FCA confirms a monotonic relationship: as credit category worsens (Excellent → Poor), the likelihood of default increases, with “Poor” customers strongly associated with DEFAULT = 1 and “Excellent” with DEFAULT = 0.
# 2. CAT_GAMBLING vs DEFAULT
ct_credit = pd.crosstab(df["CAT_GAMBLING"], df["DEFAULT"])
ca_credit = prince.CA(n_components=2, random_state=42)
ca_credit = ca_credit.fit(ct_credit)
rows_credit = ca_credit.row_coordinates(ct_credit)
cols_credit = ca_credit.column_coordinates(ct_credit)
print(rows_credit)
print(cols_credit)
plt.figure(figsize=(10, 2))
# Plot rows (categories)
plt.scatter(rows_credit[0], [0]*len(rows_credit), color="blue", label="CAT_GAMBLING")
for i, label in enumerate(rows_credit.index):
plt.text(rows_credit.iloc[i, 0], 0.02, label, color="blue")
# Plot columns (DEFAULT)
plt.scatter(cols_credit[0], [0]*len(cols_credit), color="red", label="DEFAULT")
for i, label in enumerate(cols_credit.index):
plt.text(cols_credit.iloc[i, 0], -0.02, label, color="red")
plt.axvline(0, color="black", linewidth=0.5)
plt.yticks([])
plt.xlabel("FCA Dimension 1")
plt.title("1D Correspondence Analysis: CAT_GAMBLING vs DEFAULT")
plt.legend()
plt.savefig(f"{figure_dir}/fca_analysis_gambling.png")
plt.show()
0
CAT_GAMBLING
0 -0.036054
1 0.001071
2 0.084202
0
DEFAULT
0 -0.032591
1 0.082165
The result, we got CAT_GAMBLING
| CAT_GAMBLING | Coordinate |
|---|---|
| 0 = No gambling | −0.036 |
| 1 = Low gambling | +0.001 |
| 2 = High gambling | +0.084 |
DEFAULT | DEFAULT | Coordinate | | ------------------ | ---------- | | 0 = No default | −0.033 | | 1 = Default | +0.082 |
FCA reveals a clear positive relationship between gambling intensity and default risk: high-gambling customers lie closest to the default category, while non-gamblers align with the non-default category. Low-gambling customers occupy an intermediate position, showing only a weak association.
# Define Features (X) and Targets (y)
y_classification = df['DEFAULT']
y_regression = df['CREDIT_SCORE']
#Features (X)
X = df.drop(columns=['DEFAULT', 'CREDIT_SCORE', 'CREDIT_CATEGORY' ])
# Scale the Features
scaler = StandardScaler()
# Fit and transform the features
X_scale = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scale, columns=X.columns)
print(f"\nFinal feature set shape: {X_scaled_df.shape}")
print(f"Classification Target distribution:\n{y_classification.value_counts(normalize=True)}")
X_scaled_df
Final feature set shape: (1000, 84) Classification Target distribution: DEFAULT 0 0.716 1 0.284 Name: proportion, dtype: float64
| INCOME | SAVINGS | DEBT | R_SAVINGS_INCOME | R_DEBT_INCOME | R_DEBT_SAVINGS | T_CLOTHING_12 | T_CLOTHING_6 | R_CLOTHING | R_CLOTHING_INCOME | ... | R_EXPENDITURE | R_EXPENDITURE_INCOME | R_EXPENDITURE_SAVINGS | R_EXPENDITURE_DEBT | CAT_GAMBLING | CAT_DEBT | CAT_CREDIT_CARD | CAT_MORTGAGE | CAT_SAVINGS_ACCOUNT | CAT_DEPENDENTS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.641653 | -2.221179 | 0.272084 | -1.901185 | 1.571900 | -0.258144 | -0.446605 | -0.173849 | 0.192660 | 0.316520 | ... | 0.842411 | 0.504703 | -1.773710 | -1.087520 | 1.558246 | 0.243561 | -0.555788 | -0.457373 | -11.91038 | -0.420084 |
| 1 | -0.054784 | -0.586871 | -0.027286 | -0.704448 | -0.057857 | 0.518035 | 0.287534 | -1.022509 | -1.847030 | 0.826975 | ... | -1.939268 | -0.082071 | 0.679491 | -0.166360 | -0.740052 | 0.243561 | -0.555788 | -0.457373 | 0.08396 | -0.420084 |
| 2 | -0.685076 | -1.192847 | 0.275012 | -1.071609 | 1.684639 | 1.719951 | -0.697706 | -0.219659 | 1.222678 | -0.391910 | ... | 0.718195 | 0.504703 | 1.268266 | -1.122332 | 1.558246 | 0.243561 | -0.555788 | -0.457373 | 0.08396 | -0.420084 |
| 3 | -0.018782 | -0.758162 | 0.376122 | -0.986246 | 0.634056 | 1.224568 | 0.416664 | 0.597256 | 0.350766 | 1.040342 | ... | -0.176409 | 0.504703 | 1.155687 | -0.652750 | 1.558246 | 0.243561 | -0.555788 | -0.457373 | 0.08396 | -0.420084 |
| 4 | 0.547953 | 1.448408 | 1.381167 | 1.108251 | 1.571900 | 0.126270 | -0.421101 | -0.647558 | -1.237920 | -1.628593 | ... | -0.654451 | -0.082071 | -1.084406 | -1.129809 | 1.558246 | 0.243561 | 1.799247 | 2.186400 | 0.08396 | 2.380476 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | 1.466869 | 1.714589 | 1.552338 | 0.528485 | 1.635792 | 0.577909 | 1.236307 | 1.327408 | 0.643664 | 0.119182 | ... | 0.084570 | -0.082071 | -0.684209 | -1.148612 | 1.558246 | 0.243561 | 1.799247 | 2.186400 | 0.08396 | 2.380476 |
| 996 | -0.011228 | -0.600640 | 0.426788 | -0.766068 | 0.717376 | 1.073535 | 0.230975 | 0.205494 | -0.408395 | 0.594839 | ... | -1.096107 | -0.082071 | 0.763783 | -0.772383 | -0.740052 | 0.243561 | -0.555788 | -0.457373 | 0.08396 | -0.420084 |
| 997 | -2.309647 | -0.942606 | -0.982664 | 0.206141 | 0.691980 | -0.579489 | -2.111023 | -1.979688 | 1.793215 | -2.186079 | ... | 2.887578 | 0.877940 | -0.505927 | 0.299522 | -0.740052 | 0.243561 | -0.555788 | -0.457373 | 0.08396 | -0.420084 |
| 998 | -0.593478 | -1.478950 | 0.350547 | -1.579451 | 1.640142 | 2.141093 | -0.416883 | -0.023856 | 0.775066 | 0.268842 | ... | 1.726977 | 1.102285 | 1.877274 | -1.060494 | -0.740052 | 0.243561 | 1.799247 | -0.457373 | 0.08396 | -0.420084 |
| 999 | -0.461468 | 0.190279 | -0.866535 | 0.994166 | -1.079319 | -1.199065 | -0.544323 | -0.026735 | 1.475732 | -0.468859 | ... | -0.483811 | 1.102285 | -0.880463 | 1.460234 | -0.740052 | 0.243561 | -0.555788 | -0.457373 | 0.08396 | -0.420084 |
1000 rows × 84 columns
# Count DEFAULT values
default_counts = df['DEFAULT'].value_counts().sort_index()
# Plot
plt.figure(figsize=(6, 4))
sns.barplot(
x=default_counts.index,
y=default_counts.values,
palette=['steelblue', 'tomato']
)
# Labels and title
plt.xlabel("DEFAULT (0 = Non-Default, 1 = Default)")
plt.ylabel("Number of Samples")
plt.title("Class Balance of DEFAULT Variable")
# Add percentage labels on bars
total = default_counts.sum()
for i, count in enumerate(default_counts.values):
percentage = count / total * 100
plt.text(i, count, f"{percentage:.1f}%", ha='center', va='bottom')
plt.tight_layout()
plt.show()
/tmp/ipykernel_4647/1663573361.py:6: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(
Problem Definition
The objective of this project is to predict the binary target variable DEFAULT, which indicates whether a client defaults on their credit repayment.
Formally, the task is a supervised binary classification problem defined as:
$$ y_i \in \{0,1\}, \quad i = 1, \dots, n $$where:
credit_score dataset.¶Logistic Regression was used as a baseline statistical model due to its interpretability and strong theoretical foundations.
The logistic regression model estimates the conditional probability:
$$ P(y = 1 \mid \mathbf{X}) = \sigma(\beta_0 + \mathbf{X}^\top \boldsymbol{\beta}) $$where:
$$ \sigma(z) = \frac{1}{1 + e^{-z}}, $$and $\boldsymbol{\beta} \in \mathbb{R}^p$ is the vector of regression coefficients.
The model parameters are estimated by minimizing the negative log-likelihood (binary cross-entropy loss):
$$ \mathcal{L}(\boldsymbol{\beta}) = - \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i)\log(1 - p_i) \right], $$where $p_i = P(y_i = 1 \mid \mathbf{X}_i)$.
# 1. Split Data for Classification
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_scaled_df, y_classification, test_size=0.3,
random_state=42, stratify=y_classification)
# 2. Implement Baseline Model --- Logistic Regression---
log_reg = LogisticRegression(class_weight='balanced', random_state=42, solver='liblinear')
log_reg.fit(X_train_c, y_train_c)
# 3. Evaluate Model
y_pred_prob_c = log_reg.predict_proba(X_test_c)[:, 1]
# Calculate AUC-ROC
auc_roc = roc_auc_score(y_test_c, y_pred_prob_c)
print("\n--- Classification Results (Logistic Regression) ---")
print(f"AUC-ROC Score on Test Set: {auc_roc:.4f}")
# Get feature importance scores
#importance_c = pd.Series(log_reg.feature_importances_, index=X_train_c.columns)
# Calculate other metrics
y_pred_c = log_reg.predict(X_test_c)
print("\nClassification Report:")
print(classification_report(y_test_c, y_pred_c))
--- Classification Results (Logistic Regression) ---
AUC-ROC Score on Test Set: 0.6306
Classification Report:
precision recall f1-score support
0 0.80 0.73 0.76 215
1 0.44 0.54 0.48 85
accuracy 0.67 300
macro avg 0.62 0.63 0.62 300
weighted avg 0.70 0.67 0.68 300
--- Interpreting Baseline Classification Results ---
Class 0(Non-Default)
Precision = 0.79: means that Of all customers the model predicted would not default, 79% were correct. This means 21% of those predicted as safe were actually defaulters (False Negatives).
Recall = 0.73: means that Of all customers who actually did not default, 73% were correctly identified. This means the model incorrectly flagged 27% of truly safe customers as defaulters (False Positives).
F1-Score = 0.76: A moderate score, demonstrating fair but not strong performance on the majority class.
Class 1(Default/The Target Risk Class)
Precision = 0.43 Low: Of all customers the model predicted would default, only 43% were correct. This means the model generates a high number of False Alarms (57% of those flagged were actually safe).
Recall = 0.52 Moderate: Of all customers who actually defaulted, 52% were correctly identified (or caught). This means 48% of true defaulters (high-risk individuals) were missed by the model (False Negatives).
F1-Score = 0.47: A weak balance between catching defaulters(1) and minimizing false alarms.
Overall Performance --> The low AUC =0.6256 confirms that this model needs to be replaced.
# 1. Calculate ROC curve components (False Positive Rate and True Positive Rate)
fpr, tpr, thresholds = roc_curve(y_test_c, y_pred_prob_c)
# 2. Calculate the Area Under the Curve (AUC)
roc_auc = auc(fpr, tpr)
# 3. Plot the ROC Curve
plt.figure(figsize=(8, 6))
# Plot the ROC curve for the model
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'RF ROC curve (AUC = {roc_auc:.4f})')
# Plot the diagonal line
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Chance (AUC = 0.50)')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR) / Recall')
plt.title('Receiver Operating Characteristic (ROC) Curve - Baseline Classification')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()
importance_log = pd.Series(
log_reg.coef_[0],
index=X_train_c.columns
).sort_values(key=abs, ascending=False)
print(importance_log)
plt.figure(figsize=(8, 6))
importance_log.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Logistic Regression Feature Importance (Top 15)')
plt.xlabel('Coefficient Value')
plt.grid(alpha=0.3)
plt.show()
R_GROCERIES_INCOME 0.786413
R_HEALTH_INCOME 0.681634
R_EDUCATION_SAVINGS 0.671410
T_TRAVEL_12 -0.667495
R_HEALTH_SAVINGS -0.571015
...
R_SAVINGS_INCOME 0.020393
T_HEALTH_12 -0.017926
CAT_DEPENDENTS -0.016814
R_CLOTHING_SAVINGS -0.012167
DEBT 0.002508
Length: 84, dtype: float64
Risk-increasing indicators (positive coefficients)
R_GROCERIES_INCOME (+0.77): A higher ratio of grocery expenditure to income is strongly associated with increased default risk, suggesting that individuals allocating a large portion of their income to essential expenses are more financially constrained.
R_HEALTH_INCOME (+0.70) and R_EDUCATION_SAVINGS (+0.68): Higher spending relative to income or savings in these categories also increases default probability.
Risk-reducing indicators (negative coefficients)
T_TRAVEL_12 (-0.70): Borrowers with discretionary travel expenditures are less likely to default, suggesting greater financial stability.
R_HEALTH_SAVINGS (-0.59): Higher health-related savings reduce the likelihood of default, providing a financial buffer against unexpected expenses.
Low-impact features (near-zero coefficients)
Several features, including entertainment spending, debt level, and gambling expenditure, have coefficients near zero, indicating minimal contribution to default prediction when considered alongside other variables.
Overall, the results indicate that high essential expenditure ratios are strong predictors of default risk, whereas higher savings and discretionary spending signal financial stability.
Random Forest is an ensemble learning method that combines multiple decision trees to improve predictive performance and reduce overfitting. Unlike a single decision tree, which is prone to overfitting, Random Forest constructs a collection of trees, each trained on a bootstrap sample of the data and a random subset of features. This randomness increases model diversity and enhances generalization performance.
In the classification setting, each decision tree produces a class prediction or a class probability. The final prediction is obtained by aggregating predictions across all trees using majority voting or probability averaging. The model can be expressed as:
$$ \hat{p}(y=1 \mid X) = \frac{1}{M}\sum_{m=1}^M f_m(X) $$$$ \hat{y} = \mathbb{I}\bigl(\hat{p}(y=1 \mid X) \ge \tau \bigr) $$where:
DEFAULT in this study),Random Forest models effectively capture non-linear interactions and are robust to noise and multicollinearity, which are common characteristics of financial datasets.
# Train a Random Forest Classification
rf_c = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
rf_c.fit(X_train_c, y_train_c)
# Get feature importance scores
importance_c = pd.Series(rf_c.feature_importances_, index=X_train_c.columns)
# Y prediction classification random forest
y_pred_rf = rf_c.predict(X_test_c)
y_pred_proba_rf = rf_c.predict_proba(X_test_c)[:, 1]
# Accuracy Score
accuracy = accuracy_score(y_test_c, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy:.4f}")
# AUC-ROC Score
auc_roc = roc_auc_score(y_test_c, y_pred_proba_rf)
print(f"Random Forest AUC-ROC Score: {auc_roc:.4f}")
print("\nClassification Report:")
print(classification_report(y_test_c, y_pred_rf))
Random Forest Accuracy: 0.7067
Random Forest AUC-ROC Score: 0.6148
Classification Report:
precision recall f1-score support
0 0.73 0.95 0.82 215
1 0.42 0.09 0.15 85
accuracy 0.71 300
macro avg 0.57 0.52 0.49 300
weighted avg 0.64 0.71 0.63 300
# 1. Calculate ROC curve components
fpr, tpr, thresholds = roc_curve(y_test_c, y_pred_proba_rf)
# 2. Calculate the Area Under the Curve (AUC)
roc_auc = auc(fpr, tpr)
# 3. Plot the ROC Curve
plt.figure(figsize=(8, 6))
# Plot the ROC curve for the model
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'RF ROC curve (AUC = {roc_auc:.4f})')
# Plot the diagonal line
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Chance (AUC = 0.50)')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR) / Recall')
plt.title('Receiver Operating Characteristic (ROC) Curve - Random Forest')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()
def plot_feature_importance(importance_series, title, top_n=10):
"""Plots the top N features based on importance scores"""
plt.figure(figsize=(10,6))
top_features = importance_series.nlargest(top_n)
sns.barplot(x=top_features.values, y=top_features.index, palette='viridis')
plt.title(title)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.show()
# Plot the top 10 most important features
plot_feature_importance(importance_c, 'Top 10 Features From Random Forest for Predicting DEFAULT', top_n=10)
/tmp/ipykernel_4647/2634937592.py:5: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=top_features.values, y=top_features.index, palette='viridis')
Extreme Gradient Boosting (XGBoost) is an ensemble learning algorithm based on gradient boosting decision trees. Unlike bagging-based methods such as Random Forest, XGBoost builds decision trees sequentially, where each new tree aims to correct the prediction errors of the previous ensemble. This boosting strategy enables high predictive accuracy, particularly in complex and imbalanced datasets.
For binary classification, XGBoost estimates the probability of the positive class as an additive combination of decision trees:
$$ \hat{p}(y=1 \mid X) = \sigma\left(\sum_{m=1}^{M} f_m(X)\right) $$where $\sigma(\cdot)$ denotes the logistic (sigmoid) function.
The final class prediction is obtained using a fixed decision threshold:
Here, $f_m(X)$ represents the contribution of the $m^{\text{th}}$ decision tree, $M$ is the total number of trees, and $X$ is the input feature vector.
XGBoost optimizes a regularized objective function that combines a differentiable loss function with a complexity penalty on the trees:
$$ \mathcal{L} = \sum_{i=1}^{N} \ell(y_i, \hat{y}_i) + \sum_{m=1}^{M} \Omega(f_m), $$with:
$$ \Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2, $$where $\ell$ denotes the logistic loss, $T$ is the number of leaves, $w_j$ are the leaf weights, and $\gamma$ and $\lambda$ are regularization parameters.
XGBoost further incorporates shrinkage and structural constraints (e.g., learning rate and maximum tree depth) to control model complexity and reduce overfitting, making it well-suited for credit risk classification tasks involving non-linear feature interactions.
# Train XGBoost Classification with optimal parameters
xgb_c = XGBClassifier(
n_estimators = 100
, max_depth = 6
, learning_rate = 0.1
, random_state = 42
, eval_metric = 'logloss'
)
# Fit the model
xgb_c.fit(X_train_c.values, y_train_c.values)
# Make predictions
y_pred_xgb = xgb_c.predict(X_test_c.values)
y_pred_proba_xgb = xgb_c.predict_proba(X_test_c.values)[:, 1]
print("XGBoost Classification Model trained successfully!")
XGBoost Classification Model trained successfully!
# Calculate metrics
accuracy_xgb = accuracy_score(y_test_c, y_pred_xgb)
precision_xgb = precision_score(y_test_c, y_pred_xgb)
recall_xgb = recall_score(y_test_c, y_pred_xgb)
f1_xgb = f1_score(y_test_c, y_pred_xgb)
roc_auc_xgb = roc_auc_score(y_test_c, y_pred_proba_xgb)
print("=== XGBoost Classification Metrics ===")
print(f"Accuracy: {accuracy_xgb:.4f}")
print(f"Precision: {precision_xgb:.4f}")
print(f"Recall: {recall_xgb:.4f}")
print(f"F1-Score: {f1_xgb:.4f}")
print(f"ROC-AUC: {roc_auc_xgb:.4f}")
print("\n=== Classification Report ===")
print(classification_report(y_test_c, y_pred_xgb))
print("\n=== Confusion Matrix ===")
print(confusion_matrix(y_test_c, y_pred_xgb))
=== XGBoost Classification Metrics ===
Accuracy: 0.7033
Precision: 0.4500
Recall: 0.2118
F1-Score: 0.2880
ROC-AUC: 0.6070
=== Classification Report ===
precision recall f1-score support
0 0.74 0.90 0.81 215
1 0.45 0.21 0.29 85
accuracy 0.70 300
macro avg 0.60 0.55 0.55 300
weighted avg 0.66 0.70 0.66 300
=== Confusion Matrix ===
[[193 22]
[ 67 18]]
# XGBoost Feature Importance
importance_xgb_c = pd.Series(xgb_c.feature_importances_, index=X_train_c.columns)
# Plot feature importance
plot_feature_importance(importance_xgb_c, 'Top 10 Features From XGBoost for Predicting DEFAULT', top_n=10)
/tmp/ipykernel_4647/2634937592.py:5: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=top_features.values, y=top_features.index, palette='viridis')
fpr_xgb, tpr_xgb, thresholds_xgb = roc_curve(y_test_c, y_pred_proba_xgb)
roc_auc_xgb_curve = auc(fpr_xgb, tpr_xgb)
# Plot ROC Curve
plt.figure(figsize=(10, 6))
plt.plot(fpr_xgb, tpr_xgb, color='darkorange', lw=2,
label=f'XGBoost ROC curve (AUC = {roc_auc_xgb_curve:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - XGBoost Classification')
plt.legend(loc="lower right")
plt.grid(alpha=0.3)
plt.show()
The performance of the three classification models: Logistic Regression, Random Forest, and XGBoost was evaluated on the test dataset using Accuracy, Precision, Recall, F1-score, and ROC-AUC metrics for default customer (Class 1). The results are summarized in Table Classification Model Comparison below.
# -------- Logistic Regression metrics --------
y_pred_log = log_reg.predict(X_test_c)
y_pred_proba_log = log_reg.predict_proba(X_test_c)[:, 1]
accuracy_log = accuracy_score(y_test_c, y_pred_log)
precision_log = precision_score(y_test_c, y_pred_log)
recall_log = recall_score(y_test_c, y_pred_log)
f1_log = f1_score(y_test_c, y_pred_log)
roc_auc_log = roc_auc_score(y_test_c, y_pred_proba_log)
# ROC curve for Logistic Regression
fpr_log, tpr_log, _ = roc_curve(y_test_c, y_pred_proba_log)
# -------- Random Forest metrics --------
y_pred_rf = rf_c.predict(X_test_c)
y_pred_proba_rf = rf_c.predict_proba(X_test_c)[:, 1]
accuracy_rf = accuracy_score(y_test_c, y_pred_rf)
precision_rf = precision_score(y_test_c, y_pred_rf)
recall_rf = recall_score(y_test_c, y_pred_rf)
f1_rf = f1_score(y_test_c, y_pred_rf)
roc_auc_rf = roc_auc_score(y_test_c, y_pred_proba_rf)
# ROC curve for Random Forest
fpr_rf, tpr_rf, _ = roc_curve(y_test_c, y_pred_proba_rf)
# ROC curve for XGBoost
fpr_xgb, tpr_xgb, _ = roc_curve(y_test_c, y_pred_proba_xgb)
# -------- Create comparison dataframe --------
comparison_c = pd.DataFrame({
'Model' : ['Logistic Regression', 'Random Forest', 'XGBoost'],
'Accuracy' : [accuracy_log, accuracy_rf, accuracy_xgb],
'Precision' : [precision_log, precision_rf, precision_xgb],
'Recall' : [recall_log, recall_rf, recall_xgb],
'F1-Score' : [f1_log, f1_rf, f1_xgb],
'ROC-AUC' : [roc_auc_log, roc_auc_rf, roc_auc_xgb]
})
print("=== Classification Model Comparison ===")
print(comparison_c.to_string(index=False))
fig, axes = plt.subplots(1, 2, figsize=(18, 6))
def add_bar_labels(ax):
for container in ax.containers:
ax.bar_label(container, fmt='%.3f', label_type='edge', padding=10)
# --- Bar Plot: Metrics Comparison ---
comparison_c.set_index('Model').plot(
kind='bar',
ax=axes[0],
rot=0
)
axes[0].set_title('Classification Metrics Comparison', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Score')
axes[0].set_ylim([0, 1.0])
axes[0].legend(loc='lower right')
axes[0].grid(axis='y', alpha=0.3)
add_bar_labels(axes[0])
# --- ROC Curves Comparison ---
axes[1].plot(fpr_log, tpr_log, lw=2,
label=f'Logistic Regression (AUC = {roc_auc_log:.4f})')
axes[1].plot(fpr_rf, tpr_rf, lw=2,
label=f'Random Forest (AUC = {roc_auc_rf:.4f})')
axes[1].plot(fpr_xgb, tpr_xgb, lw=2,
label=f'XGBoost (AUC = {roc_auc_xgb:.4f})')
axes[1].plot([0, 1], [0, 1], 'k--', lw=1.5, label='Random Classifier')
axes[1].set_xlim([0.0, 1.0])
axes[1].set_ylim([0.0, 1.05])
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curves Comparison', fontsize=14, fontweight='bold')
axes[1].legend(loc="lower right")
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()
=== Classification Model Comparison ===
Model Accuracy Precision Recall F1-Score ROC-AUC
Logistic Regression 0.673333 0.438095 0.541176 0.484211 0.630588
Random Forest 0.706667 0.421053 0.094118 0.153846 0.614802
XGBoost 0.703333 0.450000 0.211765 0.288000 0.607004
Logistic Regression demonstrates the highest recall (0.517) and F1-score (0.468) among the three models, indicating a better balance between identifying default cases and maintaining classification precision. This suggests that the baseline model is effective at capturing the underlying linear relationships in the dataset and provides stable performance despite its simplicity.
Random Forest achieves higher overall accuracy (0.706) compared to Logistic Regression. However, it exhibits a very low recall (0.094), indicating poor detection of default cases. This behavior suggests that the model is biased toward the majority class, which limits its usefulness in credit risk classification where identifying defaulters is critical.
XGBoost attains the highest accuracy (0.713) and precision (0.486), reflecting stronger performance in correctly identifying non-default cases. Nevertheless, its recall (0.211) remains substantially lower than that of Logistic Regression, indicating that many default instances are still missed. While XGBoost captures complex non-linear patterns, its conservative predictions reduce its effectiveness in detecting high-risk borrowers.
Why transform CREDIT_SCORE?
Although tree-based models do not require normality, linear regression performance is often improved when the target (and consequently residuals) are closer to a symmetric distribution. Transformation can stabilize variance and reduce the impact of skewness, which helps satisfy the underlying assumptions of linear regression.
We applied the Yeo–Johnson transformation because it is more flexible than Box–Cox: it supports both positive and non-positive values. The transformation is defined as:
$$ y^{(\lambda)} = \begin{cases} \dfrac{(y + 1)^\lambda - 1}{\lambda}, & \text{if } y \ge 0, \ \lambda \ne 0,\\[6pt] \log(y+1), & \text{if } y \ge 0, \ \lambda = 0,\\[6pt] -\dfrac{(-y + 1)^{2-\lambda}-1}{2-\lambda}, & \text{if } y < 0, \ \lambda \ne 2,\\[6pt] -\log(-y+1), & \text{if } y < 0, \ \lambda = 2. \end{cases} $$In our experiment, the transformed target becomes standardized:
CREDIT_SCORE: mean $= 588.36$, std $= 58.09$ pt = PowerTransformer(method='yeo-johnson', standardize=True)
# Fit and transform CREDIT_SCORE
y_regression_transformed = pt.fit_transform(y_regression.values.reshape(-1, 1)).ravel()
print(f"Original CREDIT_SCORE - Mean: {y_regression.mean():.2f}, Std: {y_regression.std():.2f}")
print(f"Transformed CREDIT_SCORE - Mean: {y_regression_transformed.mean():.2f}, Std: {y_regression_transformed.std():.2f}")
Original CREDIT_SCORE - Mean: 588.36, Std: 58.09 Transformed CREDIT_SCORE - Mean: 0.00, Std: 1.00
As a baseline, we used Ordinary Least Squares (OLS) Linear Regression due to its interpretability and strong theoretical foundation.
The goal is to estimate a linear relationship between predictors and credit score, providing a transparent benchmark that is easy to explain.
The model assumes:
$$ \hat{y}_i = \beta_0 + \mathbf{X}_i^\top \boldsymbol{\beta} + \varepsilon_i, \quad \text{with } \mathbb{E}[\varepsilon_i] = 0 $$The parameters are found by minimizing the sum of squared residuals:
$$ \min_{\beta_0,\boldsymbol{\beta}} \sum_{i=1}^{n} \left( y_i - \beta_0 - \mathbf{X}_i^\top \boldsymbol{\beta} \right)^2 $$Linear regression provides two advantages in this context:
# 1. Split Data for Regression
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_scaled_df, y_regression, test_size=0.3, random_state=42)
# 2. Implement Baseline Model ---Linear Regression---
lin_reg = LinearRegression()
lin_reg.fit(X_train_r, y_train_r)
# 3. Evaluate Model
y_pred_r = lin_reg.predict(X_test_r)
# Calculate Evalution Metrics
rmse = np.sqrt(mean_squared_error(y_test_r, y_pred_r))
r2 = r2_score(y_test_r, y_pred_r)
mean_abs_err = np.mean(np.abs(y_test_r - y_pred_r))
print("\n--- Regression Results (Linear Regression) ---")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R^2): {r2:.4f}")
print(f"Mean Absolute Error (MAE): {mean_abs_err:.2f}")
--- Regression Results (Linear Regression) --- Root Mean Squared Error (RMSE): 29.90 R-squared (R^2): 0.7432 Mean Absolute Error (MAE): 21.80
These results show that a significant portion of credit score variability is already captured by linear combinations of standardized financial ratios and spending indicators.
Random Forest Regression is an ensemble method designed to capture non-linear patterns by combining many decision trees.
Each tree is trained on a bootstrap sample of the training set, and at each split only a random subset of features is considered. This injected randomness reduces correlation between trees and improves generalization.
The prediction is obtained by averaging the outputs of $M$ trees:
$$ \hat{y}(X) = \frac{1}{M}\sum_{m=1}^M f_m(X) $$# Train feature importance
rf_r = RandomForestRegressor(n_estimators=100, random_state=42)
rf_r.fit(X_train_r, y_train_r)
# Get feature importance scores
importance_r = pd.Series(rf_r.feature_importances_, index=X_train_r.columns)
# Evaluate Model
y_pred_regrf = rf_r.predict(X_test_r)
# Calculate Evaluation Metrics
rmse_rf = np.sqrt(mean_squared_error(y_test_r, y_pred_regrf))
r2_rf = r2_score(y_test_r, y_pred_regrf)
mae_rf = np.mean(np.abs(y_test_r - y_pred_regrf))
print("--- Regression Result after Applying Random Forest ---")
print(f"Root Mean Squared Error (RMSE): {rmse_rf:.2f}")
print(f"R-squared (R^2): {r2_rf:.4f}")
print(f"Mean Absolute Error (MAE): {mae_rf:.2f}")
--- Regression Result after Applying Random Forest --- Root Mean Squared Error (RMSE): 28.07 R-squared (R^2): 0.7736 Mean Absolute Error (MAE): 20.59
Random Forest achieved the best regression performance, improving both error metrics and explained variance compared to the linear baseline.
# Plot the top 10 most important features
plot_feature_importance(importance_r, 'Top 10 Features for Predicting CREDIT_SCORE', top_n=10)
/tmp/ipykernel_4647/2634937592.py:5: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=top_features.values, y=top_features.index, palette='viridis')
The feature importance plot shows that R_DEBT_INCOME dominates prediction. This is consistent with financial intuition: clients with high debt relative to income tend to have weaker repayment capacity and thus lower creditworthiness.
XGBoost (Extreme Gradient Boosting) is a boosting-based ensemble method that builds decision trees sequentially.
Unlike Random Forest (bagging), boosting focuses on correcting prediction errors iteratively: each new tree is trained to reduce the residuals of the previous model.
The model prediction is:
$$ \hat{y}(X) = \sum_{m=1}^{M} f_m(X) $$XGBoost minimizes a regularized objective function that balances fit and complexity:
$$ \mathcal{L} = \sum_{i=1}^{n} \ell(y_i, \hat{y}_i) + \sum_{m=1}^{M} \Omega(f_m), \quad \Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2 $$where $T$ is the number of leaves and $w_j$ are the leaf weights.
This regularization is important in credit modeling because it reduces overfitting and encourages simpler tree structures.
# Train XGBoost Regression with optimal parameters
xgb_r = XGBRegressor(
n_estimators = 100
, max_depth = 6
, learning_rate = 0.1
, random_state = 42
, objective = 'reg:squarederror'
)
# Fit the model
xgb_r.fit(X_train_r.values, y_train_r.values)
# Make predictions
y_pred_xgb_r = xgb_r.predict(X_test_r.values)
print("XGBoost Regression Model trained successfully!")
XGBoost Regression Model trained successfully!
# Calculate metrics
mae_xgb = mean_absolute_error(y_test_r, y_pred_xgb_r)
mse_xgb = mean_squared_error(y_test_r, y_pred_xgb_r)
rmse_xgb = np.sqrt(mse_xgb)
r2_xgb = r2_score(y_test_r, y_pred_xgb_r)
print("=== XGBoost Regression Metrics ===")
print(f"Mean Absolute Error (MAE): {mae_xgb:.4f}")
print(f"Mean Squared Error (MSE): {mse_xgb:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_xgb:.4f}")
print(f"R² Score: {r2_xgb:.4f}")
=== XGBoost Regression Metrics === Mean Absolute Error (MAE): 21.6811 Mean Squared Error (MSE): 871.8902 Root Mean Squared Error (RMSE): 29.5278 R² Score: 0.7495
XGBoost performs similarly to the linear baseline but slightly worse than Random Forest. A likely explanation is that the dataset size (1000) and parameter choices (learning rate, depth) may limit the advantage of boosting, while Random Forest benefits more from variance reduction.
# XGBoost Feature Importance for Regression
importance_xgb_r = pd.Series(xgb_r.feature_importances_, index=X_train_r.columns)
# Plot feature importance
plot_feature_importance(importance_xgb_r, 'Top 10 Features From XGBoost for Predicting CREDIT_SCORE', top_n=10)
/tmp/ipykernel_4647/2634937592.py:5: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=top_features.values, y=top_features.index, palette='viridis')
XGBoost also emphasizes R_DEBT_INCOME as the primary driver, followed by expenditure-related ratios, reinforcing that relative financial burden is key for credit score prediction.
To verify model behavior beyond metrics, we analyzed diagnostic plots: predicted vs actual values and residual distributions.
A good regression model should produce points close to the $y=x$ line (perfect prediction) and residuals should be centered around zero without strong structure.
# Visualize Predictions vs Actual for XGBoost
plt.figure(figsize=(10, 6))
plt.scatter(y_test_r, y_pred_xgb_r, alpha=0.6, edgecolors='k')
plt.plot([y_test_r.min(), y_test_r.max()], [y_test_r.min(), y_test_r.max()],
'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual CREDIT_SCORE')
plt.ylabel('Predicted CREDIT_SCORE')
plt.title('XGBoost: Predicted vs Actual CREDIT_SCORE')
plt.legend()
plt.grid(alpha=0.3)
plt.show()
# Residual plot
residuals_xgb = y_test_r - y_pred_xgb_r
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_xgb_r, residuals_xgb, alpha=0.6, edgecolors='k')
plt.axhline(y=0, color='r', linestyle='--', lw=2)
plt.xlabel('Predicted CREDIT_SCORE')
plt.ylabel('Residuals')
plt.title('XGBoost: Residual Plot')
plt.grid(alpha=0.3)
plt.show()
The plots indicate that predictions generally follow the perfect prediction line, but dispersion increases for extreme credit scores.
The residual plot is mostly centered around zero, but a few large residuals appear, showing that some clients have atypical profiles not well captured by the model (potential outliers or rare spending/debt patterns).
We compared Random Forest and XGBoost using standard regression metrics:
# Get Random Forest regression metrics
y_pred_rf_r = rf_r.predict(X_test_r)
mae_rf = mean_absolute_error(y_test_r, y_pred_rf_r)
mse_rf = mean_squared_error(y_test_r, y_pred_rf_r)
rmse_rf = np.sqrt(mse_rf)
r2_rf = r2_score(y_test_r, y_pred_rf_r)
# Create comparison dataframe
comparison_r = pd.DataFrame({
'Model' : ['Random Forest', 'XGBoost']
, 'MAE' : [mae_rf, mae_xgb]
, 'MSE' : [mse_rf, mse_xgb]
, 'RMSE' : [rmse_rf, rmse_xgb]
, 'R² Score': [r2_rf, r2_xgb]
})
print("=== Regression Model Comparison ===")
print(comparison_r.to_string(index=False))
fig, axes = plt.subplots(1, 2, figsize=(16, 5))
error_metrics = comparison_r[['Model', 'MAE', 'RMSE']].set_index('Model')
error_metrics.plot(kind='bar', ax=axes[0], rot=0, color=['skyblue', 'lightcoral'])
axes[0].set_title('Regression Error Metrics Comparison (Lower is Better)', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Error Value')
axes[0].legend(loc='upper right')
axes[0].grid(axis='y', alpha=0.3)
r2_comparison = comparison_r[['Model', 'R² Score']].set_index('Model')
r2_comparison.plot(kind='bar', ax=axes[1], rot=0, color='lightgreen', legend=False)
axes[1].set_title('R² Score Comparison (Higher is Better)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('R² Score')
axes[1].set_ylim([0, 1.0])
axes[1].grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
best_model_r2 = comparison_r.loc[comparison_r['R² Score'].idxmax(), 'Model']
best_model_mae = comparison_r.loc[comparison_r['MAE'].idxmin(), 'Model']
print(f"\nBest model by R² Score: {best_model_r2}")
print(f"Best model by MAE: {best_model_mae}")
=== Regression Model Comparison ===
Model MAE MSE RMSE R² Score
Random Forest 20.585950 788.099835 28.073116 0.773618
XGBoost 21.681101 871.890225 29.527787 0.749549
Best model by R² Score: Random Forest Best model by MAE: Random Forest
Interpretation: Random Forest outperforms XGBoost across all metrics. This indicates that averaging many trees is particularly effective for this dataset, likely because it reduces variance and captures non-linearities without requiring careful boosting hyperparameter tuning.
To reduce feature dimensionality and mitigate multicollinearity, Principal Component Analysis (PCA) was applied prior to classification.
The models were trained and evaluated using the first 18 principal components derived from the dataset.
# Call PCA dataframe that we have created above to work.
df_pca.head()
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | ... | PC13 | PC14 | PC15 | PC16 | PC17 | PC18 | DEFAULT | CREDIT_SCORE | CATEGORY | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.263568 | 1.675703 | 1.616015 | -1.777544 | 0.900540 | 0.361975 | -2.571286 | -0.326115 | 0.336299 | 0.484445 | ... | -0.392118 | -0.575822 | 0.026258 | 1.003179 | -0.838064 | 0.592423 | 1 | 444.0 | Poor | 2 |
| 1 | -1.113713 | 0.097301 | -0.083219 | -1.488284 | -0.278663 | -1.442984 | 1.640455 | -2.546311 | -2.071634 | 1.182105 | ... | 0.011836 | -1.387963 | -1.021540 | 0.263893 | -0.475882 | 0.268768 | 0 | 625.0 | Good | 2 |
| 2 | 0.244502 | 6.755585 | -0.368538 | 1.002296 | 4.414212 | 1.369341 | -3.355328 | 0.337632 | -1.563512 | 0.385298 | ... | -0.693706 | -0.040053 | 0.396325 | 0.708798 | 0.652537 | 1.179665 | 1 | 469.0 | Poor | 5 |
| 3 | 2.228464 | 5.653206 | 2.395674 | 0.698185 | -1.647916 | 0.834930 | 0.376260 | -0.190960 | 0.288253 | 1.457348 | ... | 0.668575 | -0.342176 | 0.983585 | -1.665101 | -0.285464 | 0.304583 | 0 | 559.0 | Fair | 3 |
| 4 | 1.300493 | -2.019472 | -2.782794 | 0.626834 | -0.635844 | -2.531894 | -2.460998 | -1.658941 | 0.163559 | 0.804533 | ... | 0.124539 | 0.018004 | 0.481896 | -0.458401 | -0.139912 | -0.680096 | 0 | 473.0 | Poor | 0 |
5 rows × 22 columns
# Features (X_pca): Use PC1 through PC10
X_pca = df_pca[['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10'
, 'PC11', 'PC12', 'PC13', 'PC14', 'PC15', 'PC16', 'PC17', 'PC18']]
# Targets
y_classification = df_pca['DEFAULT']
y_regression = df_pca['CREDIT_SCORE']
print(f"PCA Feature set shape: {X_pca.shape}")
PCA Feature set shape: (1000, 18)
X_train_pca_c, X_test_pca_c, y_train_c, y_test_c = train_test_split(X_pca, y_classification,
test_size=0.3, random_state=42, stratify=y_classification)
log_reg_pca = LogisticRegression(class_weight='balanced', random_state=42, solver='liblinear')
log_reg_pca.fit(X_train_pca_c, y_train_c)
# Predict probabilities
y_pred_proba_pca_c = log_reg_pca.predict_proba(X_test_pca_c)[:, 1]
auc_roc_pca = roc_auc_score(y_test_c, y_pred_proba_pca_c)
print("\n--- CLASSIFICATION Results (Logistic Regression on PCA Features) ---")
print(f"AUC-ROC Score on Test Set (PCA): {auc_roc_pca:.4f}")
# Classification Report
y_pred_pca_c = log_reg_pca.predict(X_test_pca_c)
print("\nClassification Report (PCA):")
print(classification_report(y_test_c, y_pred_pca_c))
--- CLASSIFICATION Results (Logistic Regression on PCA Features) ---
AUC-ROC Score on Test Set (PCA): 0.6546
Classification Report (PCA):
precision recall f1-score support
0 0.78 0.64 0.71 215
1 0.38 0.55 0.45 85
accuracy 0.62 300
macro avg 0.58 0.60 0.58 300
weighted avg 0.67 0.62 0.63 300
Where P(x) = E[Y| X=x] = Probab(Y=1|X=x)
# The coefficients show the impact of each PC on the log-odds of defaulting
coefs_c = pd.Series(log_reg_pca.coef_[0], index=X_pca.columns).sort_values(ascending=False)
print("\nTop 5 PCA Factors for Default Risk:")
print(coefs_c)
Top 5 PCA Factors for Default Risk: PC12 0.263729 PC18 0.096250 PC2 0.086298 PC17 0.078865 PC16 0.075884 PC1 0.069205 PC8 0.032825 PC5 0.015162 PC3 0.000561 PC10 -0.005295 PC15 -0.013606 PC13 -0.055558 PC9 -0.079374 PC11 -0.086603 PC14 -0.091244 PC4 -0.104644 PC6 -0.118635 PC7 -0.340028 dtype: float64
plt.figure(figsize=(10, 6))
coefs_c_sorted = coefs_c.sort_values(ascending=True)
sns.barplot(x=coefs_c_sorted.values, y=coefs_c_sorted.index, palette='icefire')
plt.title('Impact of PCA Factors on Default Risk ')
plt.xlabel('Coefficient Value (Impact on Log-Odds of Default)')
plt.ylabel('Principal Component Factor')
plt.axvline(0, color='red', linestyle='--', linewidth=2) # Baseline at zero
plt.text(coefs_c_sorted.max() * 0.95, -0.5, '↑ Increases Default Risk',
fontsize=10, color='green', ha='right')
plt.text(coefs_c_sorted.min() * 0.95, -0.5, '↓ Decreases Default Risk',
fontsize=10, color='blue', ha='left')
plt.tight_layout()
plt.show()
/tmp/ipykernel_4647/3790497150.py:4: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=coefs_c_sorted.values, y=coefs_c_sorted.index, palette='icefire')
Interpretation Classification Plot (Default Risk)
Positive Coeff: These PCs are positively correlated with the log-odds of defaulting. From the plot, we can see that PC1, PC9, PC10 are positive that means the factor for instance PC9:"Tax Behavior and Travel & Health Spending" significantly increases the likelihood of default(customer will default).
Negative Coefficients: These PCs are negatively correlated with default risk. We can see that PC3, PC4, etc are negative coefficients. They are "protective" factors or significantly decreases the likelihood that a customer will default.
# Train XGBoost Classification on PCA data with optimal parameters
xgb_pca_c = XGBClassifier(
n_estimators = 100
, max_depth = 6
, learning_rate = 0.1
, random_state = 42
, eval_metric = 'logloss'
)
# Fit the model
xgb_pca_c.fit(X_train_pca_c.values, y_train_c.values)
# Make predictions
y_pred_xgb_pca_c = xgb_pca_c.predict(X_test_pca_c.values)
y_pred_proba_xgb_pca_c = xgb_pca_c.predict_proba(X_test_pca_c.values)[:, 1]
print("XGBoost Classification on PCA data trained successfully!")
XGBoost Classification on PCA data trained successfully!
# Classification Metrics for XGBoost on PCA
accuracy_xgb_pca_c = accuracy_score(y_test_c, y_pred_xgb_pca_c)
precision_xgb_pca_c = precision_score(y_test_c, y_pred_xgb_pca_c)
recall_xgb_pca_c = recall_score(y_test_c, y_pred_xgb_pca_c)
f1_xgb_pca_c = f1_score(y_test_c, y_pred_xgb_pca_c)
roc_auc_xgb_pca_c = roc_auc_score(y_test_c, y_pred_proba_xgb_pca_c)
print("=== XGBoost Classification Metrics (PCA Data) ===")
print(f"Accuracy: {accuracy_xgb_pca_c:.4f}")
print(f"Precision: {precision_xgb_pca_c:.4f}")
print(f"Recall: {recall_xgb_pca_c:.4f}")
print(f"F1-Score: {f1_xgb_pca_c:.4f}")
print(f"ROC-AUC: {roc_auc_xgb_pca_c:.4f}")
print("\n=== Classification Report ===")
print(classification_report(y_test_c, y_pred_xgb_pca_c))
print("\n=== Confusion Matrix ===")
print(confusion_matrix(y_test_c, y_pred_xgb_pca_c))
=== XGBoost Classification Metrics (PCA Data) ===
Accuracy: 0.7100
Precision: 0.4762
Recall: 0.2353
F1-Score: 0.3150
ROC-AUC: 0.6161
=== Classification Report ===
precision recall f1-score support
0 0.75 0.90 0.82 215
1 0.48 0.24 0.31 85
accuracy 0.71 300
macro avg 0.61 0.57 0.57 300
weighted avg 0.67 0.71 0.67 300
=== Confusion Matrix ===
[[193 22]
[ 65 20]]
# Feature Importance for PCA components
importance_xgb_pca_c = pd.Series(xgb_pca_c.feature_importances_, index=X_train_pca_c.columns)
plt.figure(figsize = (10, 6))
importance_sorted = importance_xgb_pca_c.sort_values(ascending=True)
sns.barplot(x=importance_sorted.values, y=importance_sorted.index, palette='viridis')
plt.title('XGBoost Feature Importance on PCA Components (Classification)')
plt.xlabel('Feature Importance Score')
plt.ylabel('PCA Components')
plt.show()
/tmp/ipykernel_4647/3022377628.py:6: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=importance_sorted.values, y=importance_sorted.index, palette='viridis')
# ROC Curve for XGBoost on PCA
fpr_xgb_pca_c, tpr_xgb_pca_c, _ = roc_curve(y_test_c, y_pred_proba_xgb_pca_c)
roc_auc_xgb_pca_c_curve = auc(fpr_xgb_pca_c, tpr_xgb_pca_c)
plt.figure(figsize=(10, 6))
plt.plot(fpr_xgb_pca_c, tpr_xgb_pca_c, color='darkorange', lw=2,
label=f'XGBoost on PCA ROC curve (AUC = {roc_auc_xgb_pca_c_curve:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - XGBoost Classification on PCA Data')
plt.legend(loc="lower right")
plt.grid(alpha=0.3)
plt.show()
# Get Logistic Regression metrics on PCA
y_pred_logreg_pca = log_reg_pca.predict(X_test_pca_c)
y_pred_proba_logreg_pca = log_reg_pca.predict_proba(X_test_pca_c)[:, 1]
accuracy_logreg_pca = accuracy_score(y_test_c, y_pred_logreg_pca)
precision_logreg_pca = precision_score(y_test_c, y_pred_logreg_pca)
recall_logreg_pca = recall_score(y_test_c, y_pred_logreg_pca)
f1_logreg_pca = f1_score(y_test_c, y_pred_logreg_pca)
roc_auc_logreg_pca = roc_auc_score(y_test_c, y_pred_proba_logreg_pca)
# Create comparison dataframe
comparison_pca_c = pd.DataFrame({
'Model' : ['Logistic Regression', 'XGBoost']
, 'Accuracy' : [accuracy_logreg_pca, accuracy_xgb_pca_c]
, 'Precision' : [precision_logreg_pca, precision_xgb_pca_c]
, 'Recall' : [recall_logreg_pca, recall_xgb_pca_c]
, 'F1-Score' : [f1_logreg_pca, f1_xgb_pca_c]
, 'ROC-AUC' : [roc_auc_logreg_pca, roc_auc_xgb_pca_c]
})
print("=== PCA Classification Model Comparison ===")
print(comparison_pca_c.to_string(index=False))
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 5))
# Bar plot comparison
comparison_pca_c.set_index('Model').plot(kind='bar', ax=axes[0], rot=0)
axes[0].set_title('Classification Metrics Comparison on PCA Data', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Score')
axes[0].set_ylim([0, 1.0])
axes[0].legend(loc='lower right')
axes[0].grid(axis='y', alpha=0.3)
# ROC Curves comparison
fpr_logreg_pca, tpr_logreg_pca, _ = roc_curve(y_test_c, y_pred_proba_logreg_pca)
roc_auc_logreg_pca_curve = auc(fpr_logreg_pca, tpr_logreg_pca)
axes[1].plot(fpr_logreg_pca, tpr_logreg_pca, color='blue', lw=2,
label=f'Logistic Regression (AUC = {roc_auc_logreg_pca_curve:.4f})')
axes[1].plot(fpr_xgb_pca_c, tpr_xgb_pca_c, color='darkorange', lw=2,
label=f'XGBoost (AUC = {roc_auc_xgb_pca_c_curve:.4f})')
axes[1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
axes[1].set_xlim([0.0, 1.0])
axes[1].set_ylim([0.0, 1.05])
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curves Comparison on PCA Data', fontsize=14, fontweight='bold')
axes[1].legend(loc="lower right")
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()
=== PCA Classification Model Comparison ===
Model Accuracy Precision Recall F1-Score ROC-AUC
Logistic Regression 0.616667 0.379032 0.552941 0.449761 0.654555
XGBoost 0.710000 0.476190 0.235294 0.314961 0.616142
Interpretation from PCA Classification Model Comparison Table
Logistic Regression achieves higher recall (0.541) and F1-score (0.440), indicating it is better at identifying default cases, despite having lower overall accuracy (0.61) compared to XGBoost. XGBoost shows higher accuracy (0.71) and precision (0.469) but misses many defaults, as reflected by its low recall (0.176) and F1-score (0.256).
Overall, for credit risk assessment on PCA-transformed data, Logistic Regression is preferable due to its better balance between detecting defaults and maintaining interpretability, whereas XGBoost prioritizes correctly classifying non-default cases.
# Get Logistic Regression metrics on PCA
y_pred_logreg_pca = log_reg_pca.predict(X_test_pca_c)
y_pred_proba_logreg_pca = log_reg_pca.predict_proba(X_test_pca_c)[:, 1]
accuracy_logreg_pca = accuracy_score(y_test_c, y_pred_logreg_pca)
precision_logreg_pca = precision_score(y_test_c, y_pred_logreg_pca)
recall_logreg_pca = recall_score(y_test_c, y_pred_logreg_pca)
f1_logreg_pca = f1_score(y_test_c, y_pred_logreg_pca)
roc_auc_logreg_pca = roc_auc_score(y_test_c, y_pred_proba_logreg_pca)
# Create comparison dataframe
comparison_pca_c = pd.DataFrame({
'Model' : ['Logistic Regression', 'XGBoost'],
'Accuracy' : [accuracy_logreg_pca, accuracy_xgb_pca_c],
'Precision' : [precision_logreg_pca, precision_xgb_pca_c],
'Recall' : [recall_logreg_pca, recall_xgb_pca_c],
'F1-Score' : [f1_logreg_pca, f1_xgb_pca_c],
'ROC-AUC' : [roc_auc_logreg_pca, roc_auc_xgb_pca_c]
})
print("=== PCA Classification Model Comparison ===")
print(comparison_pca_c.to_string(index=False))
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# -------------------------------
# Replace axes[0] with PCA coefficient impact plot
# -------------------------------
coefs_c_sorted = coefs_c.sort_values(ascending=True)
sns.barplot(x=coefs_c_sorted.values, y=coefs_c_sorted.index, palette='icefire', ax=axes[0])
axes[0].set_title('Impact of PCA Factors on Default Risk', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Coefficient Value (Impact on Log-Odds of Default)')
axes[0].set_ylabel('Principal Component Factor')
axes[0].axvline(0, color='red', linestyle='--', linewidth=2) # Baseline at zero
# Middle y-position
y_middle = len(coefs_c_sorted) / 2
axes[0].text(coefs_c_sorted.max() * 0.95, y_middle, '↑ Increases Default Risk',
fontsize=10, color='green', ha='right', va='center')
axes[0].text(coefs_c_sorted.min() * 0.95, y_middle, '↓ Decreases Default Risk',
fontsize=10, color='blue', ha='left', va='center')
# -------------------------------
# ROC Curves comparison
# -------------------------------
fpr_logreg_pca, tpr_logreg_pca, _ = roc_curve(y_test_c, y_pred_proba_logreg_pca)
roc_auc_logreg_pca_curve = auc(fpr_logreg_pca, tpr_logreg_pca)
axes[1].plot(fpr_logreg_pca, tpr_logreg_pca, color='blue', lw=2,
label=f'Logistic Regression (AUC = {roc_auc_logreg_pca_curve:.4f})')
axes[1].plot(fpr_xgb_pca_c, tpr_xgb_pca_c, color='darkorange', lw=2,
label=f'XGBoost (AUC = {roc_auc_xgb_pca_c_curve:.4f})')
axes[1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
axes[1].set_xlim([0.0, 1.0])
axes[1].set_ylim([0.0, 1.05])
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curves Comparison on PCA Data', fontsize=14, fontweight='bold')
axes[1].legend(loc="lower right")
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()
=== PCA Classification Model Comparison ===
Model Accuracy Precision Recall F1-Score ROC-AUC
Logistic Regression 0.616667 0.379032 0.552941 0.449761 0.654555
XGBoost 0.710000 0.476190 0.235294 0.314961 0.616142
/tmp/ipykernel_4647/344327062.py:32: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=coefs_c_sorted.values, y=coefs_c_sorted.index, palette='icefire', ax=axes[0])
The relative importance of principal components in predicting default was assessed using the Logistic Regression coefficients on the PCA-transformed dataset.
High-risk PCA factors (positive coefficients):
Risk-reducing PCA factors (negative coefficients):
Compared to the original feature space, performance decreases significantly. This suggests that compressing the information into 18 components removed useful predictive signals, even though it reduces collinearity.
# 1. Split Data for Regression
X_train_pca_r, X_test_pca_r, y_train_r, y_test_r = train_test_split(
X_pca, y_regression, test_size=0.3, random_state=42)
# 2. Implement Baseline Model (Linear Regression)
lin_reg_pca = LinearRegression()
lin_reg_pca.fit(X_train_pca_r, y_train_r)
# 3. Evaluate Model
y_pred_pca_r = lin_reg_pca.predict(X_test_pca_r)
# Calculate Evaluation Metrics
rmse_pca = np.sqrt(mean_squared_error(y_test_r, y_pred_pca_r))
r2_pca = r2_score(y_test_r, y_pred_pca_r)
mae_pca = np.mean(np.abs(y_test_r - y_pred_pca_r))
print("\n--- REGRESSION Results (Linear Regression on PCA Features) ---")
print(f"Root Mean Squared Error (RMSE - PCA): {rmse_pca:.2f}")
print(f"R-squared (R^2 - PCA): {r2_pca:.4f}")
print(f"Mean Absolute Error (MAE - PCA): {mae_pca:.2f}")
--- REGRESSION Results (Linear Regression on PCA Features) --- Root Mean Squared Error (RMSE - PCA): 36.87 R-squared (R^2 - PCA): 0.6096 Mean Absolute Error (MAE - PCA): 26.96
# The coefficients show the impact of each PC on the Credit Score value
coefs_r = pd.Series(lin_reg_pca.coef_, index=X_pca.columns).sort_values(ascending=False)
print("\nTop 5 PCA Factors for Credit Score:")
print(coefs_r.head())
Top 5 PCA Factors for Credit Score: PC7 19.167418 PC6 7.219287 PC4 6.562823 PC14 5.504804 PC13 3.467785 dtype: float64
Since Linear Regression is linear in PCA space, each coefficient quantifies how a one-unit increase in a component affects the predicted credit score (holding others constant).
The most influential components were:
PC7 (19.21), PC6 (7.11), PC4 (6.45), PC14 (5.85), PC9 (2.56)
plt.figure(figsize=(10, 6))
coefs_r_sorted = coefs_r.sort_values(ascending=True)
sns.barplot(x =coefs_r_sorted.values, y=coefs_r_sorted.index, palette='plasma')
plt.title('Impact of PCA Factors on Credit Score')
plt.xlabel('Coefficient Value (Impact on Credit Score)')
plt.ylabel('Principal Component Factor')
plt.axvline(0, color='red', linestyle='--', linewidth=2) # Baseline at zero
plt.text(coefs_r_sorted.max() * 0.95, -0.5, '↑ Increases Credit Score',
fontsize=10, color='green', ha='right')
plt.text(coefs_r_sorted.min() * 0.95, -0.5, '↓ Decreases Credit Score',
fontsize=10, color='blue', ha='left')
plt.tight_layout()
plt.show()
/tmp/ipykernel_4647/519707514.py:4: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x =coefs_r_sorted.values, y=coefs_r_sorted.index, palette='plasma')
Interpretation Regression Plot (Credit Score)
Positive Coeff: These PCs increase predicted CREDIT_SCORE. The dominant effect of PC7 suggests that the latent behavior captured by this component corresponds to financially stable patterns.
Negative Coefficients: These PCs are negatively correlated with the CREDIT_SCORE. They decrease the predicted credit score. These components likely represent patterns associated with higher financial stress (e.g., debt pressure or essential spending burden).
# Train XGBoost Regression on PCA with params
xgb_pca_r = XGBRegressor(
n_estimators = 100
, max_depth = 6
, learning_rate = 0.1
, random_state = 42
, objective = 'reg:squarederror'
)
# Fit the model
xgb_pca_r.fit(X_train_pca_r.values, y_train_r.values)
# Make predictions
y_pred_xgb_pca_r = xgb_pca_r.predict(X_test_pca_r.values)
print("XGBoost Regression on PCA data trained successfully!")
XGBoost Regression on PCA data trained successfully!
# Regression Metrics for XGBoost on PCA
mae_xgb_pca_r = mean_absolute_error(y_test_r, y_pred_xgb_pca_r)
mse_xgb_pca_r = mean_squared_error(y_test_r, y_pred_xgb_pca_r)
rmse_xgb_pca_r = np.sqrt(mse_xgb_pca_r)
r2_xgb_pca_r = r2_score(y_test_r, y_pred_xgb_pca_r)
print("=== XGBoost Regression Metrics (PCA Data) ===")
print(f"Mean Absolute Error (MAE): {mae_xgb_pca_r:.4f}")
print(f"Mean Squared Error (MSE): {mse_xgb_pca_r:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_xgb_pca_r:.4f}")
print(f"R² Score: {r2_xgb_pca_r:.4f}")
=== XGBoost Regression Metrics (PCA Data) === Mean Absolute Error (MAE): 28.3259 Mean Squared Error (MSE): 1424.3450 Root Mean Squared Error (RMSE): 37.7405 R² Score: 0.5909
# Feature Importance for PCA components (Regression)
importance_xgb_pca_r = pd.Series(xgb_pca_r.feature_importances_, index=X_train_pca_r.columns)
plt.figure(figsize=(10, 6))
importance_sorted_r = importance_xgb_pca_r.sort_values(ascending=True)
sns.barplot(x=importance_sorted_r.values, y=importance_sorted_r.index, palette='rocket')
plt.title('XGBoost Feature Importance on PCA Components (Regression)')
plt.xlabel('Feature Importance Score')
plt.ylabel('PCA Components')
plt.show()
/tmp/ipykernel_4647/3556927487.py:6: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=importance_sorted_r.values, y=importance_sorted_r.index, palette='rocket')
XGBoost identifies PC7 as the most influential component, consistent with the linear coefficient analysis. This agreement strengthens confidence that PC7 captures a latent factor strongly related to creditworthiness.
# Visualize Predictions vs Actual for XGBoost on PCA
plt.figure(figsize=(10, 6))
plt.scatter(y_test_r, y_pred_xgb_pca_r, alpha=0.6, edgecolors='k')
plt.plot([y_test_r.min(), y_test_r.max()], [y_test_r.min(), y_test_r.max()],
'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual CREDIT_SCORE')
plt.ylabel('Predicted CREDIT_SCORE')
plt.title('XGBoost on PCA: Predicted vs Actual CREDIT_SCORE')
plt.legend()
plt.grid(alpha=0.3)
plt.show()
# Residual plot
residuals_xgb_pca = y_test_r - y_pred_xgb_pca_r
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_xgb_pca_r, residuals_xgb_pca, alpha=0.6, edgecolors='k')
plt.axhline(y=0, color='r', linestyle='--', lw=2)
plt.xlabel('Predicted CREDIT_SCORE')
plt.ylabel('Residuals')
plt.title('XGBoost on PCA: Residual Plot')
plt.grid(alpha=0.3)
plt.show()
Interpretation: Compared to the original-feature setting, predictions show larger dispersion and residuals, which matches the weaker $R^2$ on PCA data.
# Get Linear Regression metrics on PCA
y_pred_linreg_pca = lin_reg_pca.predict(X_test_pca_r)
mae_linreg_pca = mean_absolute_error(y_test_r, y_pred_linreg_pca)
mse_linreg_pca = mean_squared_error(y_test_r, y_pred_linreg_pca)
rmse_linreg_pca = np.sqrt(mse_linreg_pca)
r2_linreg_pca = r2_score(y_test_r, y_pred_linreg_pca)
# Create comparison dataframe
comparison_pca_r = pd.DataFrame({
'Model' : ['Linear Regression', 'XGBoost']
, 'MAE' : [mae_linreg_pca, mae_xgb_pca_r]
, 'MSE' : [mse_linreg_pca, mse_xgb_pca_r]
, 'RMSE' : [rmse_linreg_pca, rmse_xgb_pca_r]
, 'R² Score': [r2_linreg_pca, r2_xgb_pca_r]
})
print("=== PCA Regression Model Comparison ===")
print(comparison_pca_r.to_string(index=False))
fig, axes = plt.subplots(1, 2, figsize=(16, 5))
# Bar plot comparison for error metrics
error_metrics_pca = comparison_pca_r[['Model', 'MAE', 'RMSE']].set_index('Model')
error_metrics_pca.plot(kind='bar', ax=axes[0], rot=0, color=['skyblue', 'lightcoral'])
axes[0].set_title('Regression Error Metrics Comparison on PCA Data (Lower is Better)', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Error Value')
axes[0].legend(loc='upper right')
axes[0].grid(axis='y', alpha=0.3)
r2_comparison_pca = comparison_pca_r[['Model', 'R² Score']].set_index('Model')
r2_comparison_pca.plot(kind='bar', ax=axes[1], rot=0, color='lightgreen', legend=False)
axes[1].set_title('R² Score Comparison on PCA Data (Higher is Better)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('R² Score')
axes[1].set_ylim([0, 1.0])
axes[1].grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
# Determine best model
best_model_pca_r2 = comparison_pca_r.loc[comparison_pca_r['R² Score'].idxmax(), 'Model']
best_model_pca_mae = comparison_pca_r.loc[comparison_pca_r['MAE'].idxmin(), 'Model']
print(f"\nBest model by R² Score: {best_model_pca_r2}")
print(f"Best model by MAE: {best_model_pca_mae}")
=== PCA Regression Model Comparison ===
Model MAE MSE RMSE R² Score
Linear Regression 26.963196 1359.150316 36.866656 0.609584
XGBoost 28.325929 1424.344999 37.740495 0.590856
Best model by R² Score: Linear Regression Best model by MAE: Linear Regression
Interpretation: XGBoost achieves slightly higher $R^2$, suggesting it models weak non-linearities between PCA components. However, the difference is small and both PCA-based models remain significantly worse than models trained on the full feature space.
df.shape
(1000, 87)
# 1. Define Features (X) and Targets (y)
y_classification = df['DEFAULT']
X_raw = df.drop(columns=['DEFAULT', 'CREDIT_SCORE', 'CREDIT_CATEGORY'])
# 2. Split Data for Classification
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
X_raw, y_classification, test_size=0.3, random_state=42, stratify=y_classification
)
# 3. Scaling the dataset
# We use StandardScaler because financial data often has outliers
scaler = StandardScaler()
# Fit on training data ONLY
X_train_c_scaled = scaler.fit_transform(X_train_c)
# Transform test data using the training fit
X_test_c_scaled = scaler.transform(X_test_c)
# 4. Handle Imbalance with SMOTE
# SMOTE is applied ONLY to the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_c_scaled, y_train_c)
print(f"Original training shape: {y_train_c.value_counts().to_dict()}")
print(f"Resampled training shape: {pd.Series(y_train_resampled).value_counts().to_dict()}")
Original training shape: {0: 501, 1: 199}
Resampled training shape: {1: 501, 0: 501}
# 1. Initialize Logistic Regression
log_smote = LogisticRegression(random_state=42, solver='liblinear')
log_smote.fit(X_train_resampled, y_train_resampled)
# 2. Evaluate Model using probabilities
y_pred_prob_smote = log_smote.predict_proba(X_test_c_scaled)[:, 1]
# 3. Calculate AUC-ROC
auc_smote = roc_auc_score(y_test_c, y_pred_prob_smote)
print("\n--- Classification Results (Logistic Regression + SMOTE) ---")
print(f"AUC-ROC Score on Test Set: {auc_smote:.4f}")
# 4. Generate Classification report
y_pred_smote = log_smote.predict(X_test_c_scaled)
print("\nClassification Report:")
print(classification_report(y_test_c, y_pred_smote))
--- Classification Results (Logistic Regression + SMOTE) ---
AUC-ROC Score on Test Set: 0.6260
Classification Report:
precision recall f1-score support
0 0.78 0.70 0.74 215
1 0.40 0.51 0.45 85
accuracy 0.65 300
macro avg 0.59 0.60 0.59 300
weighted avg 0.67 0.65 0.66 300
# Train Random Forest
rf_smote = RandomForestClassifier(n_estimators=100, random_state=42)
rf_smote.fit(X_train_resampled, y_train_resampled)
# Y pred
y_rf = rf_smote.predict(X_test_c_scaled)
y_pred_probrf = rf_smote.predict_proba(X_test_c_scaled)[:, 1]
# Accuray
acc = accuracy_score(y_test_c, y_rf)
print(f"Random Forest Accuracy: {accuracy:.4f}")
# AUC-ROC Score
auc_smote = roc_auc_score(y_test_c, y_pred_probrf)
print(f"Random Forest AUC-ROC Score: {auc_smote:.4f}")
print("\nClassification Report:")
print(classification_report(y_test_c, y_rf))
Random Forest Accuracy: 0.7067
Random Forest AUC-ROC Score: 0.6206
Classification Report:
precision recall f1-score support
0 0.74 0.84 0.79 215
1 0.40 0.27 0.32 85
accuracy 0.68 300
macro avg 0.57 0.55 0.55 300
weighted avg 0.65 0.68 0.66 300
This analysis evaluated classification (DEFAULT prediction) and regression (CREDIT_SCORE prediction) tasks using both original features (104 dimensions including 23 engineered features) and PCA-reduced features (18 dimensions, 90.4% variance). The study also assessed the effectiveness of three dimensionality reduction techniques: PCA, CCA, and FCA.
Best Model on Original Features: Logistic Regression
Best Model on PCA Features: Logistic Regression
Best Model on Original Features: Random Forest
Best Model on PCA Features: XGBoost
PCA caused significant performance degradation:
Loss of predictive information: The 18 principal components failed to preserve crucial relationships between features.
Linear structure of credit scoring: Linear Regression achieved R² = 0.7553, nearly matching Random Forest, suggesting that credit scoring is largely a linear problem when features are properly engineered.
Original features essential: The full 104-dimensional feature space, including engineered variables, is necessary for high-quality predictions.
Configuration:
Advantages:
Disadvantages:
Verdict:
Analysis: Financial Status (INCOME, SAVINGS, DEBT, ratios) vs. Spending Behavior (expenditure categories)
Results:
Insights Gained:
Practical Implications:
Verdict: Excellent technique for understanding multivariate relationships; provides actionable insights for feature engineering and model interpretation
Analysis 1: CREDIT_CATEGORY vs DEFAULT
| Category | FCA Coordinate | Interpretation |
|---|---|---|
| Excellent | -0.6298 | Strongly aligned with non-default |
| Good | -0.2184 | Slightly aligned with non-default |
| Fair | +0.0873 | Slightly leaning toward risk |
| Poor | +0.8486 | Strongly aligned with default |
Analysis 2: CAT_GAMBLING vs DEFAULT
| Gambling Level | FCA Coordinate | Interpretation |
|---|---|---|
| No (0) | -0.0361 | Aligned with non-default |
| Low (1) | +0.0011 | Neutral |
| High (2) | +0.0842 | Aligned with default risk |
Insights Gained:
Verdict: Highly effective for exploratory analysis of categorical variables; provides clear, interpretable insights that complement quantitative models